Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Under review as a conference paper at COLM 2024

Efficient Prompting Methods for Large Language Models:


A Survey

Kaiyan Chang1 , Songcheng Xu1 , Chenglong Wang1 , Yingfeng Luo1 ,


Tong Xiao1,2 ∗, Jingbo Zhu1,2
1
School of Computer Science and Engineering, Northeastern University, Shenyang, China
2
NiuTrans Research, Shenyang, China
changkaiyan neu@outlook.com, winsome.xsc@gmail.com
{xiaotong, zhujingbo}@mail.neu.edu.cn
arXiv:2404.01077v1 [cs.CL] 1 Apr 2024

Abstract

Prompting has become a mainstream paradigm for adapting large language models
(LLMs) to specific natural language processing tasks. While this approach opens
the door to in-context learning of LLMs, it brings the additional computational
burden of model inference and human effort of manual-designed prompts, particu-
larly when using lengthy and complex prompts to guide and control the behavior
of LLMs. As a result, the LLM field has seen a remarkable surge in efficient
prompting methods. In this paper, we present a comprehensive overview of these
methods. At a high level, efficient prompting methods can broadly be categorized
into two approaches: prompting with efficient computation and prompting with
efficient design. The former involves various ways of compressing prompts, and
the latter employs techniques for automatic prompt optimization. We present the
basic concepts of prompting, review the advances for efficient prompting, and
highlight future research directions.

1 Introduction

Large Language Models (LLMs) have significantly advanced state-of-the-art on various natural
language processing (NLP) tasks, such as dialogue, machine translation, and summarization (Brown
et al., 2020; Touvron et al., 2023; Bubeck et al., 2023). Prompting is an important medium for
human-computer interaction to explicitly deliver clear task descriptions to LLMs, which then generate
user-desired responses through analogical learning. The content of a prompt can vary across different
contexts, specifically containing instructions, questions, multiple demonstrations with specific output
formats, and additional requirements like complex reasoning processes and role-playing commands.
In this paper, the term prompt refers to the user input to LLMs.
However, as the in-context learning (ICL) (Dong et al., 2022) ability of LLMs becomes more
powerful, prompts designed for different specific tasks tend to be diverse and detailed. Ultra-long
natural language prompts gradually raise two issues: 1) for LLM itself, the context window is limited,
affecting its potential to handle excessively lengthy contexts; 2) for LLM users, it requires either
substantial computational resources to train open-source models or high costs to call closed-source
model interfaces. From this point of view, the cost of LLM utilization is quite huge in both academic
research and commercial deployment scenarios. It’s obviously a shame that LLM with excellent
performance cannot be widely used. While there are many relevant improvements for model structure,
such as efficient attention (see (Xiao & Zhu, 2023; Wan et al., 2023) for related work) that can
effectively mitigate inference costs, in this paper, we focus more on efficient prompting methods to
save unnecessary financial overheads.
Considering financial and human resources, efficiency can be improved from three perspectives: 1)
inference acceleration, 2) memory consumption decline, and 3) automatically well-designed prompts.
The first two objectives can be achieved by prompt compression, while the third objective can be
attained by automatic prompt optimization based on prompt engineering rather than manual design.

Corresponding author.

1
Under review as a conference paper at COLM 2024

Generative Application Downstream


Models Paradigms Tasks

Supervised Natural Learnable


Single Language Vectors
Learning
NL
Fine-tune
Pre-train NLU/NLG Compressor


input
input
… label
label
LM
NL

Prompt
Large NLU+NLG
NL
… NL

Figure 1: NLP paradigm shift Figure 2: Hard & soft prompts

To the best of our knowledge, there is a notable gap in the literature regarding a comprehensive
consolidation of efficient prompting methods.
In this survey, we commence with an introduction to the background of prompting in Section 2.
Subsequently, we examine existing efficient prompting methods in aspects of Computation (Sec.3)
and Design (Sec.4). The former organizes prompt compression into three categories: knowledge
distillation (Sec.3.1), encoding (Sec.3.2), and filtering (Sec.3.3). The latter explores automatic prompt
optimization based on traditional gradient-descent (Sec.4.1) and intelligent evolutionary algorithms
(Sec.4.2). In particular, we abstract the efficient prompting as a multi-objective optimization problem
and look forward to future directions from a theoretical perspective (Sec.5). Finally, we summarize
the full text in Section 6. Additionally, we include a list of open-source projects A.2 for convenient
reference and a typology diagram A.3 of efficient prompting methods.

2 Overview
2.1 Prompt paradigm

The emergence of prompting is closely associated with the evolution of Pre-trained Language Models
(PLMs) and the advancement of Large Language Models (LLMs).
PLM Evolution The evolutionary trajectory of the PLM paradigm has transitioned from effective-
ness to efficiency. Since the Transformer (Vaswani et al., 2017) was proposed, it has become the
foundational architecture of a wide range of PLMs. The self-attention mechanism within Transformer
has been proven effective in solving long sequence problems. To tackle basic Natural Language
Understanding (NLU) and Natural Language Generation (NLG) tasks respectively, mainstream PLMs
have gradually evolved into the BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) series
models. There are numerous optimization strategies such as exploring encoding methods (Su et al.,
2021), improving self-attention mechanism (Roy et al., 2021) and refining model structures (Li et al.,
2021) to achieve efficient performance of PLMs in solving specific tasks.
NLP Paradigm Shift The NLP training paradigm has witnessed two pivotal shifts (Liu et al., 2023b),
evolving from “fully supervised learning” to “pre-train and fine-tune”, and eventually to “pre-train,
prompt, and predict” (as illustrated in Figure 1). In this survey, we will concentrate on the most
extensively adopted prompting paradigm, delving into its recent developments. Notably, GPT-3
(Brown et al., 2020) played a seminal role in introducing the hard prompt, enabling humans to use
natural language to interact with language models. This breakthrough was made possible by large-
scale parameters, which empower GPT-3 with a deep understanding of natural language, thus allowing
it to leverage complex hard prompts for few-shot learning without the necessity for fine-tuning.
LLM Advancement After GPT-3 pioneering the LLM era, ChatGPT stands out as a significant
milestone in shaping the prevailing mainstream paradigm of “LLM + prompting”. Its perfect
integration of NLU and NLG capabilities has enthralled the entire Artificial Intelligence community.
As scaling laws (Wei et al., 2022a) demonstrate remarkable emergent abilities (e.g., instruction
following, in-context learning, and complex reasoning), researchers have persistently explored the

2
Under review as a conference paper at COLM 2024

Template Instruction Complex


Translate English Assuming you are a mathematician, help me solve the following math problem.
to French: +Demonstrations
sea otter => loutre Q:Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How
de mer much did she earn?
A:10
cheese => {Q:…A:…}
Q:James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a
year?
A:
+Chain-of-Thoughts
Q:Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How
much did she earn?
A:Weng earns 12/60 = 0.2 per minute. Working 50 minutes, she earned 0.2 x 50 = 10. ### 10.
{Q:…A:…}
Q:James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a
year?
Concise A:

Figure 3: More detailed hard prompts, where the highlighting indicates additional reasoning process

performance boundaries of prompting whether open- or closed-source LLMs. For instance, complex
prompts like Chain-of-Thought (CoT) (Wei et al., 2022b) amplify the potential reasoning ability of
LLMs by thinking aloud. While the prompting paradigm gradually solidifies its place, the LLM still
faces challenges in terms of computational and human resources due to its large-scale parameters.
Thus, effective prompting methods to conserve resources have attracted broad interest.

2.2 Types of prompts

In essence, the primary objective of prompting is to achieve effective few-shot learning in place of
the unnecessary resource consumption of full parameter fine-tuning. Prompt expressions can be
categorized into two main types, as illustrated in Figure 2: discrete natural language prompts (referred
to as hard prompts) and continuous learnable vectors (referred to as soft prompts).

2.2.1 Hard prompt


Hard prompts are particularly well-suited for generative language models, notably exemplified by the
GPT series of models. The reasons why hard prompts deserve particular attention are two-fold. From
a positive standpoint, due to the vast amount of pre-training data integrated into LLMs, humans can
easily interact with the world knowledge compressor (i.e. LLM) through native language, ultimately
obtaining helpful responses. From a negative standpoint, due to the widespread closed-source nature
of current LLMs, making their parameter weights inaccessible, users have no choice but to utilize hard
prompts with LLMs through API calls. Nevertheless, LLM’s strong instruction-following capability
lays a solid foundation for the development of hard prompts, and natural language as a medium for
seamless human-computer interaction is just around the corner.
It is important to highlight the diverse variations among hard prompts. Initially, hard prompts
comprised concise task instructions resembling templates designed for the Cloze task. However,
as the comprehension capabilities of LLMs continue to advance, hard prompts have evolved to
incorporate a broader array of elements, most commonly including demonstrations and Chain-of-
Thoughts as depicted in Figure 3. The increasing interest in hard prompts within the current NLP
community, where even tutorials for unlocking the full potential of LLMs, indicates the desire for
human-model alignment leading to Artificial General Intelligence (AGI).

2.2.2 Soft prompt


In the early stages of prompt-related research, soft prompts were manifested in forms such as adapters
(Houlsby et al., 2019), prefixes (Li & Liang, 2021), and even unexplainable vectors. Numerous studies
(Lester et al., 2021; Liu et al., 2022) have investigated the benefits of soft prompts in enhancing

3
Under review as a conference paper at COLM 2024

efficient training by exploring different embedding positions. The standard methods involve freezing
the original model parameters and solely training soft prompts to achieve the effects of complete
parameter fine-tuning. There is a more detailed introduction in Ding et al. (2022)’s work.
Given that learnable vectors can be updated along with neural network parameters, soft prompts are
evidently more conducive for LLMs to comprehend prompts effectively. It is essential to note that
the soft prompts discussed in this paper are simply vector representations of hard prompts for LLMs
as shown in Figure 2, rather than abstract vectors developed from scratch. Some endeavors have
involved compressing longer hard prompts into significantly shorter soft prompts (Refer to Sections
3.1 and 3.2 for detailed insights).

2.3 Challenges

In light of the fact that hard prompts have been widely recognized and applied in various downstream
tasks. Prompts are designed with greater detail to enhance task accuracy, consequently leading
to more lengthy and complex prompts. In this survey, we present two key challenges facing hard
prompts from an efficiency standpoint:
Lengthy Prompt Content The length of the prompt typically varies depending on the specific tasks,
the more demonstrations, the better the performance. For instance, the Chain-of-Thought (CoT)
prompt has significantly enhanced the logical reasoning ability of LLMs, leading to the emergence of
various CoT-based methods. Like Self-Ask (Press et al., 2022) and Least-to-Most prompt (Zhou et al.,
2022a) assist LLMs in breaking down complex questions into simpler subquestions for step-by-step
answers. Wang et al. (2022) sample a diverse set of reasoning paths and Wang et al. (2023b) guide
LLMs to generate correct PS (Plan and Solution), and then select the final answer. However, the
advantages of using such complex prompts are accompanied by higher financial burdens, along with
reduced information perception capabilities of LLMs.
Difficult Prompt Design Due to the discrete nature of natural language, early usable hard prompts
are usually manual-designed and then obtained through repeated trial and error. Hand-crafted prompt
templates rely heavily on empirical knowledge and involve obvious human subjectivity. But there
is a difference between human problem-solving approaches and those of neural networks, in other
words, the interpretability of LLMs is still a topic of continuous exploration, with no recognized
theoretical guidance at present. As a result, there are many challenges in prompt design for LLMs,
including the high sensitivity of LLMs to natural language prompt formats, the large performance
gap of semantically similar prompts, the correlation between prompt complexity and task difficulty,
and the model- and task-specific attributes of the prompts. So, it is time-consuming and laborious to
manually design prompts of high quality in the face of different models and different tasks.
In conclusion, prompting effectively mitigates the parameter redundancy problems when applied to
downstream tasks, thereby saving financial resources. However, in the era of LLMs, the increasing
length of prompts has brought challenges such as larger memory requirement, slower inference speed,
and heightened labor intensity, which deviates from the original purpose of prompting. Therefore,
this survey delves into the efficient prompting methods currently employed in LLMs.

3 Prompting with efficient computation


As the scale of LLMs continues to expand, the concept of “Prompting with Efficient Computation”
has emerged as a response to alleviate the economic burden that lengthy prompts impose on both open-
source and closed-source LLMs. It has been observed that compressed prompts can be effectively
reconstructed by LLMs and reduce the length of the generated text (Jiang et al., 2023a). In this section,
we provide insight into research related to prompt compression, categorized into text-to-vector level
and text-to-text level approaches. The primary objective of prompt compression is to extract essential
information from the original prompts so that LLMs can maintain a comparable performance level
with the original prompts.

3.1 Knowledge distillation

Knowledge Distillation (KD) (Hinton et al., 2015) is a classic compression method, the core idea
is to direct a lightweight student model to “mimic” a better-performing and more complex teacher

4
Under review as a conference paper at COLM 2024

model. The KD methods in this paper are oriented towards the compression of the prompt itself
rather than the model structure, which aims to compress the natural language information of the
hard prompt inside LLMs through soft prompt tuning. The common practice is to use the outputs
of the teacher model T as labels for training the student model S to align with the distribution of T .
With the objective of compressing the hard prompts c, the loss function is typically Kullback-Leibler
Divergence between teacher and student: LS (pcS , c) = Ex [DKL (pT (y | c, x) ∥ pcS (y | x))].
In-context learning (ICL) is currently the main mode of adapting LLM to downstream tasks, Askell
et al. (2021) introduce Context Distillation technique to successfully compress a HHH (helpful,
honest, and harmless) prompt consisting of 4600 words inside LLMs, achieving a more beneficial
human-alignment effect compared to ICL. Choi et al. (2022) enable the student model to grasp
inaccessible prompt information through knowledge distillation. Snell et al. (2022) propose a method
of “learning by distilling context”, allowing LLMs to internalize ICL benefits from prompts. The
zero-shot student prompts contain only [task-input] and [final answers], omitting [instructions] and
[reasoning process] of few-shot teacher prompts, used for fine-tuning the same model to directly
predict the final answer with minimal instructions.
Although KD methods compress prompt information into language models at the macro level, but
there is no concrete carrier for the compressed information. It was not until Wingate et al. (2022)’s
work that compressed prompt took shape slightly. It experimentally verifies that conditioned LLMs
can effectively compress hard prompts into soft prompts by minimizing the KL-divergence between
both distributions, which retain a substantive amount of information about the original prompts.
Utilizing the Bayesian attribute classifier framework with contrastive contexts methods, it is further
discovered that even a single token in the compressed prompts can effectively control the reduction
of text toxicity.
More briefs on similar work (Mu et al., 2023; Kim et al., 2024) are detailed in the Appendix A.1.1.
As can be seen, knowledge distillation is an effective means of prompt compression, however, it
requires synthesizing the training data with the help of another teacher model, essentially mimicking
the behavior of the better-performing model, rather than learning the ability to compress prompts.

3.2 Encoding

A steady stream of experiments (Wang et al., 2023a; Hendel et al., 2023) have confirmed that the
LLM can act as a compressor to convert input texts to vectors, and we refer to this text-to-vector level
compression as encoding. Current encoding methods fine-tune LMs with Cross-Entropy objective,
effectively compressing the extensive information of hard prompts into a set of concise and model-
accessible vectors, thus alleviating the efficiency concerns associated with lengthy texts.
The semantic information from all modalities in the context is valuable for prompting LLMs. Early
work (Gal et al., 2022) encoded image information into a special token, e.g., “a photograph of S∗ ”,
denoting image features by S∗ , and then training models to generate desired images based on this
natural language prompt. Subsequently, research on this kind of information encoding flourished in
the NLP community.
Interestingly, eXtensible Prompt proposed by Ge et al. (2022) can encode indescribable styles in
natural language into imaginary words w e in extensible vocabulary. The w e is mixed with natural
language tokens during training, noting that other parameter weights except for w
e are frozen. In order
to avoid In-Domain (ID) overfitting and achieve Out-of-Distribution (OOD) robustness, Context-
Augmented Learning (CAL), which contains template and content augmentation, achieves OOD
X-Prompts with general usability to compensate for communication gaps between humans and LLMs.
Unlike the alignment (Wingate et al., 2022) between hard and soft prompts, Chevalier et al. (2023)
introduce the AutoCompressor building on the RMT (Bulatov et al., 2022) architecture. It recursively
compresses long documents into compact summary vectors by fine-tuning PLMs with an unsupervised
objective (i.e. next-token prediction), which are then accessible to the model as soft prompts.
The detailed compression process is to split the long document according to random length, and
sequentially compress segment content and k summary tokens into summary vectors, which are
prepended to the next segment inputs, and so on recursively compressed until finally accumulating
summary vectors of the whole document. Summary vectors can encode important information of

5
Under review as a conference paper at COLM 2024

long context especially suitable for retrieval settings and can also be pre-computed, cached, and
re-used like Mu et al. (2023) to reduce memory footprint and speed up inference.
More briefs on similar work (Ge et al., 2023; Qin & Durme, 2023) are detailed in the Appendix A.1.2.
We observe that the above works have designed fictitious tokens (indicated in italics in this section)
expanding the vocabulary before compression, and these tokens extrinsically appear to condense
the key information into a form understandable to LLMs; intrinsically can it be interpreted that they
extend the semantics according to the LLM properties? Compression associated with LLMs is a topic
worth pursuing in the future.

3.3 Filtering

The above two prompt compression methods require close involvement of LLMs, but the prevailing
LLMs closed parameter weights and there are not enough computational resources available for
text-to-vector level compression even with open-source LLMs. The current state of LLMs in
dealing with lengthy prompts is that their ICL capability will be impeded by input window size or
information redundancy, ultimately resulting in suboptimal responses. So comes the question: is all
the information in the prompt beneficial for LLMs? Obviously Not. In response, researchers gradually
turn their attention to text-to-text level compression, which involves evaluating the information entropy
of different lexical structures in the prompt by a lightweight LM, and then filtering out redundant
information to the greatest extent, ultimately simplifying the user prompts.
Selective Context (Li et al., 2023) leverages the concept of “self-information” introduced by Shannon
(Shannon, 1948) to quantify the amount of information within a prompt. Li et al. (2023) utilize
a causal language model to predict the output probability of each token xi and then compute self-
information, defined as the negative log-likelihood probability I(xi ) = − log2 P (xi |x0 , x1 , ..., xi−1 )
at sentence-level. Redundant information is then filtered out using a percentile-based filtering method,
where three distinct types of semantic units (token, phrase, and sentence) are defined as the filtered
information units. Notably, the phrase-level filtering is verified to be the most effective approach in
optimizing information retention.
In addition to the initial filtering of original prompts, some of the work also considered how well
the filtered prompts fit into LLMs. So Jiang et al. (2023a) devise a series of tricks to refine the
information integrity and LLM adaptability of the prompts. Different from the selective context,
LLMLingua refines the prompts into three components: instructions, questions, and demonstrations.
The budget controller will dynamically allocate different compression ratios for each component,
with redundant demonstrations compressed at a relatively high ratio. After demonstration-level
compression, iterative token-level prompt compression (ITPC) is performed based on the perplexity
distribution of all segments calculated by a small compressor PLM (Alpaca-7B or GPT2-Alpaca).
Finally, in order to achieve distribution alignment, the small compressor PLM is instruction fine-tuned
with the output from the LLM with compressed prompts.
More briefs on similar work (Jiang et al., 2023b; Pan et al., 2024; Liu et al., 2023a; Fei et al., 2023)
are detailed in the Appendix A.1.3.
The above works achieve text-to-text level prompt compression by measuring and filtering out
redundant information in natural language prompts as much as possible. Relatively shorter LLM
inputs inevitably bring the benefits of faster inference and lower memory overhead, so such model-
agnostic prompt compression work will be more popular in the era of closed-source LLMs.
Conclusion In order to achieve efficient computation of prompting, this section focuses on prompt
compression. It evolves from a parametric optimization at the text-to-vector level to parameter-free
optimization at the text-to-text level. Based on the amount of natural language information, the LLM
input is shortened as much as possible, which in turn achieves the following three aspects of high
efficiencies: 1) representing information as vectors to decrease perplexity and enhance task accuracy;
2) significantly reducing memory usage, ideal for situations that involve pre-computation, caching,
and reuse; and 3) expanding the context window capacity and accelerating inference of lengthy
prompts, particularly beneficial for scenarios with extensive context. In brief, prompt compression
emerges as a promising research avenue for efficiently prompting LLMs in the future.

6
Under review as a conference paper at COLM 2024

Original
Prompts
Input
Optimization
Gradient Evolution
Algorithms
Meta Instruct Optimizer
Candidate Evaluator
Prompts LM Augment prompt search space Prompts
Output
Feedback
Optimized
Prompts Iteratively Update

Figure 4: Process of automatic prompt optimization

4 Prompting with efficient design

The concept of “Prompting with Efficient Design” is introduced in response to the growing complexity
of prompt content. As time-consuming and laborious manual-design prompt methods faded out of the
history stage, and gradient-based prompt tuning methods were no longer applicable to closed-source
LLMs, automatic optimization based on Prompt Engineering (PE) gradually came into the limelight.
Specifically, the “discrete” prompt optimization presented in this paper involves finding the best
“natural language” prompt within a given search space to maximize task accuracy. Based on the
powerful generalist capabilities of LLMs, automatic prompt optimization has shown promising
progress, whose workflow is roughly illustrated in Figure 4. We will explore this problem in
depth from the perspectives of traditional mathematical optimization and intelligent algorithmic
optimization, thus dividing this section into gradient-based and evolution-based approaches.

4.1 Gradient-based methods

Gradient-descent is a core traditional optimization algorithm. In the process of updating parameters


in neural networks, many objective functions evolve from gradient-descent algorithms. It is well
known that the precondition for using the gradient-descent algorithm in continuous optimization
space (i.e., language models) is that the objective function is differentiable. However, the hard
prompt mentioned in this paper is discrete, which leads to a contradiction between the optimization
object and the optimization space. Therefore, researchers have investigated suitable gradient-based
optimization frameworks for open-source and closed-source models separately. For open-source
models, fine-tuning can be performed based on the real gradient, while for closed-source models, the
gradient can only be imitated for prompting.

4.1.1 Real-gradient tuning


Extensive research (Lester et al., 2021) in the initial stages has demonstrated significant advantages
in model performance afforded by the implementation of soft prompts, there are still some efforts on
how to optimize hard prompts with soft prompt tuning. Scholars have developed various strategies
that fundamentally translate discrete space into continuous space so that discrete prompt optimization
can be performed based on real-gradient tuning.
AutoPrompt (Shin et al., 2020) is one of the first discrete prompt optimization frameworks for
Transformer encoder-only models. It reformulates the task-specific objective as a language modeling
task similar to the fill-in-the-blank problem, with the purpose of eliciting knowledge from LMs.
AutoPrompt constructs a prompt template containing the original input, trigger tokens, and [MASK]
tokens. In the case of the sentiment analysis task, the logical classifier automatically detects a set
of labeled tokens corresponding to [MASK] as candidates and trigger tokens are selected from the
vocabulary based on a gradient search, iteratively replacing the initial [MASK] according to the
scores, and finally obtaining a complete natural language prompt.
Due to the lack of interpretability of soft prompts and intractable optimization over discrete tokens,
RLPrompt (Deng et al., 2022) proposes a discrete prompt optimization approach based on reinforce-
ment learning. For classification and generation tasks, different LMs are used to feed back rewards
respectively, which is a more stable training strategy and improves the training efficiency compared

7
Under review as a conference paper at COLM 2024

to AutoPrompt. RLPrompt computes the MLP gradients by back-propagating through the policy LM
and explores the prompt search space more systematically and efficiently than GrIPS (Prasad et al.,
2023). It also verifies that gibberish prompts are transferrable between PLMs, further illustrating that
PLMs grasp shared structures for prompting, as well as being distinct from human language patterns.
For the text-to-image generation task, in order to generate more aesthetically pleasing images,
Promptist (Hao et al., 2022) utilizes PLM like GPT-2 as the prompt optimization model to improve
the quality of textual prompts in the text-to-image task. The prompt adaptation framework includes
supervised fine-tuning with human-engineered prompts and then reinforcement learning based on
Relevance and Aesthetic as the reward signals.
More briefs on similar work (Wen et al., 2023; Chen et al., 2023b; Lin et al., 2023) are detailed in the
Appendix A.1.4.

4.1.2 Imitated-gradient prompting


As the performance of LLM becomes more and more excellent, researchers try to adopt LLMs as the
optimizer, but in the face of more and more inaccessible black-box LLMs, gradient-based prompting
optimization methods are no longer able to achieve discrete prompting optimization (Shin et al.,
2020). As far as the current research trend is concerned, researchers either extend the search space of
hard prompts or represent the essence of gradients in natural language.
GrIPS (Prasad et al., 2023) represents a pioneering approach in the realm of automatic prompt
optimization, introducing Gradient-free, Edit-based Instructional Prompt Search as the first attempt
at optimizing natural language prompts utilizing API-based (i.e., closed-source) models. The prompt
modes contain different combinations of instruction, in-context examples, and test instances. GrIPS
expands the prompt optimization space based on four editing operations (delete, swap, paraphrase,
addition) at the phrase level to identify the most effective prompt through iterative beam search.
Inspired by GrIPS, Zhou et al. (2022b) propose Automatic Prompt Engineer (APE) for automatic
instruction generation and selection. Here we emphasize the important role played by LLMs in
three aspects, 1) Inference: generating a large number of candidate prompts for a specific task’s
input-output pairs; 2) Scoring: selecting better candidate prompts according to evaluation metrics; 3)
Resampling: Monte Carlo searching iteratively to further augment high-quality prompts based on
semantic similarity. The LLMs are guided through a manually designed instruction template (known
as meta-prompt in the following) to realize the corresponding functions that optimization requires.
More briefs on similar work (Pryzant et al., 2023) are detailed in the Appendix A.1.4.

4.2 Evolution-based methods

It is because black-box models do not have a clear optimization direction, and traditional mathematical
optimization algorithms are difficult to solve such NP-hard problems. The evolutionary algorithm,
which simulates the biological evolution process of “survival of the fittest” in nature, is a kind of
random search by sampling objective functions. In fact, many researches are based on the idea of
evolution, which is essentially to exploit the diversity of samples in the search space, and then explore
the optimization direction through iterations.
OPRO (Yang et al., 2023) places additional confidence in the capabilities of LLM as an optimizer.
Diverging from approaches such as APE and APO, which involve direct instructions to the LLM on
executing specific functions, OPRO integrates the optimization trajectory within the meta-prompt.
This strategy enables the LLM to understand the commonalities among high-scoring solutions,
thereby independently identifying the optimal direction for optimization. Experimental evidence
indicates that OPRO achieves a more consistent optimization process in comparison to EvoPrompt
which relies on high-quality and task-specific initial prompts.
Whether using LLM editing, evolution, or rephrasing methods is currently only attempted on shorter
prompts, applying the above works to long prompts is still challenging. As a result, Hsieh et al. (2023)
pay more attention to the discrete optimization of long prompts. It decomposes long prompts into
multiple individual sentences and employs the LLM-Mutator to generate semantic-similar prompts
with the “beam search” version greedy algorithm, when coupled with the inclusion of search history,
results in a notable enhancement in task performance.

8
Under review as a conference paper at COLM 2024

More briefs on similar work (Chen et al., 2023a; Guo et al., 2023; Fernando et al., 2023) are detailed
in the Appendix A.1.5.
Conclusion In order to avoid extensive human resource consumption of manual-designed prompts,
this section discusses a variety of methods and frameworks for automatic prompt optimization. The
prevailing research draws on the idea of optimization algorithms from traditional machine learning.
The primary objective is quickly searching for the optimal prompt in the natural language space,
thereby facilitating state-of-the-art LLMs. Initial explorations into gradient-based optimization have
seen the incorporation of lightweight neural networks as supplementary optimization instruments.
Subsequent methods have evolved to either transform the discrete prompt space into a continuous
one for optimization purposes or to broaden the discrete search domain to enable direct optimization.

5 Future prompting: a theoretical perspective

At a higher level, we would like to abstract the efficient prompting paradigm into a multi-objective
optimization problem, with the overall objective of compressing prompts to reduce the computational
complexity (Objective 1) while optimizing the LLM task accuracy (Objective 2). We respectively
define the inputs as X, the outputs as Y , and the accessible parameters as Θ. Where X is discrete,
collectively referred to as hard prompts in this paper, and can be decomposed into different com-
ponents, such as instructions xins , questions xque , demonstrations xdes , etc. Θ is continuous and
contains not only the accessible parameters of the model itself θM , but also the vector representations
of the hard prompt θp . The overall optimization formulation can be defined as Ftotal :

Ftotal = λ1 · Fcompression (X)


e + λ2 · Faccuracy (Θ) (1)
 
Fcompression (X)
e = min D Y (X
e | argmaxI(X)),
e Y (X) (2)

where Xe denotes the prompts that is compressed as much as possible while keeping the performance
comparable to the original ones, D(·) denotes the discrepancy between the outputs before and after
prompt compression Fcompression , and I(x) denotes the information entropy metrics that identify the
amount of information. Depending on the appropriate compression rate, the most useful semantic
information in each prompt component is retained according to the semantic units. In order to
achieve the faithfulness of the original prompts, the information filtering can be considered as a
binary classification task of the valid and invalid information as well (Sec.3.3).
The optimization process of Θ is to fine-tune the accessible parameters in the continuous space
to achieve desired task accuracy Faccuracy . Since the optimization object (hard prompt) is discrete,
one approach is to convert it to the soft prompt form through knowledge distillation (Sec.3.1) or
encoding (Sec.3.2). Another option is to convert the optimization space into a discrete one. Inspired
by optimization algorithms, the powerful LLM can respectively act as a generator, optimizer, and
evaluator to expand the prompt space, and then search for the best prompt based on the feedback,
similar to the reinforcement learning framework mentioned in Section 4.
Prospects for future work The huge potential of LLM remains to be explored, while efficient
prompting methods as an important medium for human-computer interaction is a key means to
achieve alignment. Looking ahead, the inaccessibility of LLM has become an irreversible situation.
Therefore, it can be foreseen that the trend of future prompting will also be centered around the hard
prompt. Based on Eq.1, we discuss the promising future research directions in this field.

• Filtering X: Measuring the amount of information in natural language that is beneficial to


LLMs and then filtering out redundant information as much as possible.
• Fine-tuning Θ: When constructing compressors, LLMs should focus on a reasonable
conversion of different semantic space while smaller LMs should focus on the construction
of the compressed training data.
• Co-optimization: simultaneous optimization of hard&soft prompt has not yet appeared, the
combination of two different semantic spaces deserves further exploration, which is, to some
extent, a considerable development direction for human-model alignment.

9
Under review as a conference paper at COLM 2024

6 Conclusion

In this work, we summarize efficient prompting methods for LLMs with the purpose of improving
the LLM efficiency and performance. We review existing related work with high recognition, shed
light on the inherent connections within each category, and deeply abstract these approaches from a
theoretical perspective. Finally, we provide LLM practitioners with a list of open-source projects A.2
for quick reference in both scientific research and commercial deployment and a typology diagram
A.3 to overview the efficient prompting field.

Acknowledgements

This work was supported in part by the National Science Foundation of China (No.62276056), the
Natural Science Foundation of Liaoning Province of China (2022-KF-16-01), the Fundamental Re-
search Funds for the Central Universities (Nos. N2216016 and N2316002), the Yunnan Fundamental
Research Projects (No. 202401BC070021), and the Program of Introducing Talents of Discipline to
Universities, Plan 111 (No.B16009).

References
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones,
Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny
Hernandez, John Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack
Clark, Sam McCandlish, Christopher Olah, and Jared Kaplan. A general language assistant as a
laboratory for alignment. ArXiv preprint, abs/2112.00861, 2021. URL https://arxiv.org/
abs/2112.00861.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar-
wal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCan-
dlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot
learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan,
and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual
Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/
1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John A. Gehrke, Eric Horvitz, Ece Ka-
mar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi,
Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments
with gpt-4. ArXiv preprint, abs/2303.12712, 2023. URL https://arxiv.org/abs/2303.
12712.

Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Recurrent memory transformer. ArXiv preprint,
abs/2207.06881, 2022. URL https://arxiv.org/abs/2207.06881.

Angelica Chen, David Dohan, and David R. So. Evoprompting: Language models for code-level
neural architecture search. ArXiv preprint, abs/2302.14838, 2023a. URL https://arxiv.
org/abs/2302.14838.

Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou. Instructzero: Efficient
instruction optimization for black-box large language models. ArXiv preprint, abs/2306.03082,
2023b. URL https://arxiv.org/abs/2306.03082.

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to
compress contexts. ArXiv preprint, abs/2305.14788, 2023. URL https://arxiv.org/abs/
2305.14788.

10
Under review as a conference paper at COLM 2024

Eunbi Choi, Yongrae Jo, Joel Jang, and Minjoon Seo. Prompt injection: Parameterization of fixed
inputs. ArXiv preprint, abs/2206.11349, 2022. URL https://arxiv.org/abs/2206.
11349.

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song,
Eric Xing, and Zhiting Hu. RLPrompt: Optimizing discrete text prompts with reinforcement
learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro-
cessing, pp. 3369–3391, Abu Dhabi, United Arab Emirates, 2022. Association for Computational
Linguistics. URL https://aclanthology.org/2022.emnlp-main.222.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, 2019.
Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://
aclanthology.org/N19-1423.

Ning Ding, Yujia Qin, Guang Yang, Fu Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen,
Chi-Min Chan, Weize Chen, Jing Yi, Weilin Zhao, Xiaozhi Wang, Zhiyuan Liu, Haitao Zheng,
Jianfei Chen, Yang Liu, Jie Tang, Juan Li, and Maosong Sun. Delta tuning: A comprehensive study
of parameter efficient methods for pre-trained language models. ArXiv preprint, abs/2203.06904,
2022. URL https://arxiv.org/abs/2203.06904.

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and
Zhifang Sui. A survey on in-context learning. 2022. URL https://api.semanticscholar.
org/CorpusID:255372865.

Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han. Extending context
window of large language models via semantic compression. ArXiv preprint, abs/2312.09571,
2023. URL https://arxiv.org/abs/2312.09571.

Chrisantha Fernando, Dylan S. Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel.
Promptbreeder: Self-referential self-improvement via prompt evolution. ArXiv preprint,
abs/2309.16797, 2023. URL https://arxiv.org/abs/2309.16797.

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel
Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual
inversion. ArXiv preprint, abs/2208.01618, 2022. URL https://arxiv.org/abs/2208.
01618.

Tao Ge, Jing Hu, Li Dong, Shaoguang Mao, Yanqiu Xia, Xun Wang, Siyi Chen, Furu Wei, and
Si-Qing Chen. Extensible prompts for language models on zero-shot language style customization.
2022. URL https://api.semanticscholar.org/CorpusID:254125409.

Tao Ge, Jing Hu, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context
compression in a large language model. ArXiv preprint, abs/2307.06945, 2023. URL https:
//arxiv.org/abs/2307.06945.

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian,
Yujiu Yang, Tsinghua University, and Microsoft Research. Connecting large language models with
evolutionary algorithms yields powerful prompt optimizers. ArXiv preprint, abs/2309.08532, 2023.
URL https://arxiv.org/abs/2309.08532.

Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation.
ArXiv preprint, abs/2212.09611, 2022. URL https://arxiv.org/abs/2212.09611.

Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. ArXiv
preprint, abs/2310.15916, 2023. URL https://arxiv.org/abs/2310.15916.

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.
ArXiv preprint, abs/1503.02531, 2015. URL https://arxiv.org/abs/1503.02531.

11
Under review as a conference paper at COLM 2024

John H. Holland. Adaptation in natural and artificial systems: An introductory analysis with
applications to biology, control, and artificial intelligence. 1992. URL https://api.
semanticscholar.org/CorpusID:58781161.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea
Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP.
In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International
Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA,
volume 97 of Proceedings of Machine Learning Research, pp. 2790–2799. PMLR, 2019. URL
http://proceedings.mlr.press/v97/houlsby19a.html.
Cho-Jui Hsieh, Si Si, Felix X. Yu, and Inderjit S. Dhillon. Automatic engineering of long prompts.
ArXiv preprint, abs/2311.10117, 2023. URL https://arxiv.org/abs/2311.10117.
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing
prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing, pp. 13358–13376, 2023a.
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu.
Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression.
ArXiv preprint, abs/2310.06839, 2023b. URL https://arxiv.org/abs/2310.06839.
Gyeongman Kim, Doohyuk Jang, and Eunho Yang. Promptkd: Distilling student-friendly knowledge
for generative language models via prompt tuning, 2024.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient
prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing, pp. 3045–3059, Online and Punta Cana, Dominican Republic, 2021.
Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL
https://aclanthology.org/2021.emnlp-main.243.
Bei Li, Quan Du, Tao Zhou, Shuhan Zhou, Xin Zeng, Tong Xiao, and Jingbo Zhu. Ode transformer:
An ordinary differential equation-inspired model for neural machine translation. ArXiv preprint,
abs/2104.02308, 2021. URL https://arxiv.org/abs/2104.02308.
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
pp. 4582–4597, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.
acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
Yucheng Li, Bo Dong, Chenghua Lin, and Frank Guerin. Compressing context to enhance inference
efficiency of large language models. In Conference on Empirical Methods in Natural Language Pro-
cessing, 2023. URL https://api.semanticscholar.org/CorpusID:263830231.
Xiaoqiang Lin, Zhaoxuan Wu, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick
Jaillet, and Bryan Kian Hsiang Low. Use your instinct: Instruction optimization using neural
bandits coupled with transformers. ArXiv preprint, abs/2310.02905, 2023. URL https://
arxiv.org/abs/2310.02905.
Junyi Liu, Liangzhi Li, Tong Xiang, Bowen Wang, and Yiming Qian. Tcra-llm: Token compression
retrieval augmented large language model for inference cost reduction. In Conference on Empirical
Methods in Natural Language Processing, 2023a. URL https://api.semanticscholar.
org/CorpusID:264439519.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig.
Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language
processing. ACM Computing Surveys, 55(9):1–35, 2023b.
Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning:
Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.
61–68, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.
acl-short.8. URL https://aclanthology.org/2022.acl-short.8.

12
Under review as a conference paper at COLM 2024

Jesse Mu, Xiang Lisa Li, and Noah D. Goodman. Learning to compress prompts with gist tokens.
ArXiv preprint, abs/2304.08467, 2023. URL https://arxiv.org/abs/2304.08467.
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin,
Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang, Karl Cobbe,
Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry
Tworek, Jacob Hilton, and Reiichiro Nakano. Llmlingua-2: Data distillation for efficient and
faithful task-agnostic prompt compression. 2024. URL https://api.semanticscholar.
org/CorpusID:268531237.
Jason Phang, Yi Mao, Pengcheng He, and Weizhu Chen. Hypertuning: Toward adapting large
language models without back-propagation. In International Conference on Machine Learning,
2022. URL https://api.semanticscholar.org/CorpusID:253761398.
Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. GrIPS: Gradient-free, edit-based
instruction search for prompting large language models. In Proceedings of the 17th Conference of
the European Chapter of the Association for Computational Linguistics, pp. 3845–3864, Dubrovnik,
Croatia, 2023. Association for Computational Linguistics. URL https://aclanthology.
org/2023.eacl-main.277.
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring
and narrowing the compositionality gap in language models. ArXiv preprint, abs/2210.03350,
2022. URL https://arxiv.org/abs/2210.03350.
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt
optimization with ”gradient descent” and beam search. In Conference on Empirical Methods
in Natural Language Processing, 2023. URL https://api.semanticscholar.org/
CorpusID:258546785.
Guanghui Qin and Benjamin Van Durme. Nugget: Neural agglomerative embeddings of text. ArXiv
preprint, abs/2310.01732, 2023. URL https://arxiv.org/abs/2310.01732.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language
understanding by generative pre-training. 2018.
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse
attention with routing transformers. Transactions of the Association for Computational Linguistics,
9:53–68, 2021. doi: 10.1162/tacl a 00353. URL https://aclanthology.org/2021.
tacl-1.4.
Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:623–656, 1948.
URL https://api.semanticscholar.org/CorpusID:55379485.
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Elicit-
ing knowledge from language models using automatically generated prompts. ArXiv preprint,
abs/2010.15980, 2020. URL https://arxiv.org/abs/2010.15980.
Charles Burton Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context. ArXiv preprint,
abs/2209.15189, 2022. URL https://arxiv.org/abs/2209.15189.
Rainer Storn and Kenneth V. Price. Differential evolution – a simple and efficient heuristic for global
optimization over continuous spaces. Journal of Global Optimization, 11:341–359, 1997. URL
https://api.semanticscholar.org/CorpusID:5297867.
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer
with rotary position embedding. ArXiv preprint, abs/2104.09864, 2021. URL https://arxiv.
org/abs/2104.09864.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand
Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language
models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.
13971.

13
Under review as a conference paper at COLM 2024

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von
Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman
Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.
5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/
3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan,
Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, and Mi Zhang. Efficient large language models:
A survey. ArXiv preprint, abs/2312.03863, 2023. URL https://arxiv.org/abs/2312.
03863.
Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label
words are anchors: An information flow perspective for understanding in-context learning. ArXiv
preprint, abs/2305.14160, 2023a. URL https://arxiv.org/abs/2305.14160.
Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim.
Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language
models. In Annual Meeting of the Association for Computational Linguistics, 2023b. URL
https://api.semanticscholar.org/CorpusID:258558102.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Huai hsin Chi, and Denny Zhou.
Self-consistency improves chain of thought reasoning in language models. ArXiv preprint,
abs/2203.11171, 2022. URL https://arxiv.org/abs/2203.11171.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed Huai hsin Chi, Tatsunori Hashimoto, Oriol
Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.
ArXiv preprint, abs/2206.07682, 2022a. URL https://arxiv.org/abs/2206.07682.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le,
and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ArXiv
preprint, abs/2201.11903, 2022b. URL https://arxiv.org/abs/2201.11903.
Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein.
Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery.
ArXiv preprint, abs/2302.03668, 2023. URL https://arxiv.org/abs/2302.03668.
David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and contrastive
conditioning for controllability and toxicity reduction in language models. In Findings of the
Association for Computational Linguistics: EMNLP 2022, pp. 5621–5634, Abu Dhabi, United Arab
Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.
org/2022.findings-emnlp.412.
Tong Xiao and Jingbo Zhu. Introduction to transformers: an nlp perspective. ArXiv preprint,
abs/2311.17633, 2023. URL https://arxiv.org/abs/2311.17633.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun
Chen. Large language models as optimizers. ArXiv preprint, abs/2309.03409, 2023. URL
https://arxiv.org/abs/2309.03409.
Zhichao Yin, Binyuan Hui, Min Yang, Fei Huang, and Yongbin Li. Dialclip: Empowering clip as
multi-modal dialog retriever. ArXiv preprint, abs/2401.01076, 2024. URL https://arxiv.
org/abs/2401.01076.
Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Olivier Bousquet, Quoc Le, and Ed Huai hsin Chi. Least-to-most prompting enables complex
reasoning in large language models. ArXiv preprint, abs/2205.10625, 2022a. URL https:
//arxiv.org/abs/2205.10625.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan,
and Jimmy Ba. Large language models are human-level prompt engineers. ArXiv preprint,
abs/2211.01910, 2022b. URL https://arxiv.org/abs/2211.01910.

14
Under review as a conference paper at COLM 2024

A Appendix

A.1 A collection of more briefs

A.1.1 Knowledge distillation

Gisting (Mu et al., 2023) compresses multi-task prompts into gist tokens, which are concrete vector
representations that can be attributed to the encoding in Section 3.2, but its implementation is still
based on the idea of KD in Snell et al. (2022). Gisting fine-tunes the distilled model (i.e. student)
with attention masking mechanism using the instruction fine-tuned LLMs (i.e. teacher) synthesized
multi-task data. Similar to the way HyperTuning (Phang et al., 2022) generates PEFT parameters, the
distilled model predicts gist tokens so that gisting incurs no additional training costs beyond standard
instruction fine-tuning. Improving on previous works, gisting implements multitask capabilities and
can be generalized to unseen task instructions, achieving a 26x compression rate while maintaining
the original performance.
PromptKD (Kim et al., 2024) leverages prompt tuning in KD for the first time to enable generative
LLMs to transfer student-friendly knowledge. First, the student model generates the response as
pseudo-target according to the request. Then, extra added soft prompts of the teacher distill student-
friendly knowledge with guidance from the student. Finally, the discrepancy between teacher and
student is minimized by reverse KL divergence.

A.1.2 Encoding

ICAE (Ge et al., 2023) compresses long context input into memory slots with a 4x compression rate.
In the pre-training phase, the autoencoding task is responsible for memory slot compression; the
text continuation task extends its generalization. The instruction fine-tuning phase is beneficial for
memory slots to accurately and appropriately respond to various instructions. Compared to Chevalier
et al. (2023), the training process is simpler as it avoids recursive compression.
The encoding framework NUGGET (Qin & Durme, 2023) employs hard-attention to map linguistic
input into a fractional number of nuggets. It utilizes a feedforward network to measure the context
information of every token embedding and dynamically selects top-k as nuggets. Since the selection
process is nondifferentiable, a residual connection is built to propagate the gradients back to the
NUGGET. Results indicate that punctuation has the highest probability of being selected as nuggets
to encode the preceding text information, and a 10% compression ratio is sufficient for perfect
reconstruction of the original input.
Some recent work (Yin et al., 2024) has also shown that fusing and then compressing useful multi-
modal information into vector representations that are easily understood by the model, is helpful for
prompting LLMs to perform multi-modal tasks with excellent results.

A.1.3 Filtering

On the basis of the LLMLingua framework, LongLLMLingua (Jiang et al., 2023b) supports modifica-
tion of the prompt component, e.g. the demonstration can be further extended to a long document. To
refine the previous work, it enhances LLM’s perception of key information related to the question
component in the prompt via contrastive perplexity, i.e., the distribution shift caused by the condition
of the question. There are also a lot of effective tricks to improve the performance of the model,
such as document reordering mechanism to precedence important information; dynamic compression
ratios to adaptively control the granular of coarse & fine-grained compression, and post-compression
subsequence recovery strategy to improve the integrity of the key information in LLM’s output. It
finally realizes the performance improvement relative to the original prompts and accelerates the
end-to-end latency in long context scenarios.
Pan et al. (2024) compresses prompts with a classification objective of whether to preserve or
discard tokens from the original prompt, with the advantages of 1) fully extracting features from
bi-directional contexts using Transformer encoder-based LMs; 2) lower latency since the binary
classification objective is the same as the training objective of the LM rather than the information
entropy; and 3) guaranteed faithfulness of the compressed prompt. LLMLingua-2 serves as a smaller-

15
Under review as a conference paper at COLM 2024

scale LM compressor, achieves more efficient compression, greater generalization, and further
reduced end-to-end latency compared to previous work.
Liu et al. (2023a) respectively analyze the gain in inference speed of summarization compression
(based on summarization model mT5) and semantic compression (based on semantic deviation
calculated by Euclidean distance) when LLM processing retrieval augmentation tasks, and explore
the relationship between information entropy and compressibility in detail. While Fei et al. (2023)
directly use the summary model for semantic compression to extend the information capacity of the
context window.

A.1.4 Gradient-based methods


Real-gradient tuning Wen et al. (2023) propose a gradient-based discrete prompt optimization
algorithm called PEZ, to discretize the continuous space by a projection function. PEZ first projects
the learnable embeddings into discrete vectors then computes their gradients for updating the contin-
uous iterations (i.e., regarding soft prompts as intermediate variables), and finally maps them into
hard prompts. PEZ automatically generates robust and flexible hard prompts that can be applied to
text-to-image and text-to-text applications, respectively.
Chen et al. (2023b) generates natural language prompts for black-box models by soft prompt tuning
open-source LLMs based on Bayesian Optimization (BO), outperforming APE’s zero-shot task
accuracy on 32 tasks. On this basis, Lin et al. (2023) adopt a neural bandit algorithm that replaces
the Gaussian process (GP) in BO with a neural network (NN) surrogate to optimize instructions for
black-box models.
Imitated-gradient prompting A similar work APO (ProTeGi) (Pryzant et al., 2023) introduces
natural language gradients as an alternative to numerical gradients. The meta-prompts instruct the
LLM to perform the following functions: 1) generating the flaw of the target prompt in natural
language (i.e., gradient); 2) editing the prompt in the opposite semantic direction of the gradient to
fix the flaw and 3) paraphrasing candidate prompts under conditions of keeping semantic meaning.
Essentially, discrete prompt optimization with nonparametric “Gradient-Descent” is implemented by
substituting differentiation with LLM feedback and backpropagation with LLM editing.

A.1.5 Evolution-based methods


Chen et al. (2023a) combines hard prompt engineering and soft prompt tuning, incorporating ideas
from evolutionary meta-learning algorithms into the overall prompt optimization process. It is well
known that evolutionary algorithms require a well-designed discrete search space, and the goal of
this work is to enhance the flexibility and diversity of the search space through the LLM acting
as a crossover operator. The EvoPrompting method consists of 5 phases that iteratively generate
high-quality solutions: initialing with a few well-designed seed prompts, crossing over and mutating
via 2-shot in-context learning, filtering, and scoring generated samples, fitness-based selecting top
candidates and soft prompt tuning with unselected samples.
In fact, there are also some nice works using search algorithms for prompt design, unlike Evo-
Prompting which mainly applies algorithmic ideas to the process design of prompting LLMs, but
instead optimizes the prompts themselves. For example, EvoPrompt (Guo et al., 2023) framework
mainly explores highly regarded evolutionary algorithms (i.e., Genetic Algorithm (Holland, 1992)
and Differential Evolution (Storn & Price, 1997)). LLMs are regarded as evolutionary operators, and
carefully designed instructions guide them to complete the evolutionary process.
Promptbreeder (Fernando et al., 2023) further improves the diversity of prompts and task accuracy
compared with a number of previous and concurrent related works, such as APE, OPRO, and
EvoPrompt. The major innovation from other works is that Promptbreeder evolves the task-prompt
while also optimizing the mutation-prompt online, i.e., the previously mentioned meta-prompts.
Promptbreeder designed five mutation operators based on LLMs, which generate and evolve prompts
in a self-referential way, enabling the prompts to self-improvement.

16
Under review as a conference paper at COLM 2024

A.2 A list of open-source projects

We collated links to open-source projects involved in this paper (refer to Table 1 below) for quick
reference in both scientific research and commercial deployment.

Method Link
Prompt Injection https://github.com/unbiarirang/
Fixed-Input-Parameterization
Gisting https://github.com/jayelm/gisting
Textual-Inversion https://textual-inversion.github.io/
AutoCompressor https://github.com/princeton-nlp/AutoCompressors
ICAE https://github.com/getao/icae
NUGGET https://github.com/hiaoxui/nugget
Selective Context https://github.com/liyucheng09/Selective_Context
LLMLingua https://huggingface.co/spaces/microsoft/LLMLingua
LLMLingua-2 https://llmlingua.com/llmlingua2.html
AutoPrompt https://ucinlp.github.io/autoprompt/
RLPrompt https://github.com/mingkaid/rl-prompt
GrIPS https://github.com/archiki/GrIPS
APE https://github.com/keirp/automatic_prompt_engineer
PEZ https://github.com/YuxinWenRick/
hard-prompts-made-easy
InstructZero https://github.com/Lichang-Chen/InstructZero
INSTINCT https://github.com/xqlin98/INSTINCT
APO https://github.com/microsoft/LMOps/tree/main/
prompt_optimization
OPRO https://github.com/google-deepmind/opro
EvoPrompt https://github.com/beeevita/EvoPrompt

Table 1: Open resources of efficient methods

17
Under review as a conference paper at COLM 2024

A.3 A typology diagram of efficient prompting methods

In order to provide a more intuitive view of the typology of existing efficient prompting methods, we
listed only a few early efforts in each category (refer to Figure 5 below).

Context Distillation (Askell et al., 2021)

Distilling Context (Snell et al., 2022)


Knowledge
Distillation Gisting (Mu et al., 2023)

PromptKD (Kim et al., 2024)

Textual-Inversion (Gal et al., 2022)

X-Prompt (Ge et al., 2022)


Efficient
Encoding AutoCompressor (Chevalier et al., 2023)
Computation
ICAE (Ge et al., 2023)

NUGGET (Qin & Durme, 2023)

Selective Context (Li et al., 2023)

LLMLingua (Jiang et al., 2023a)


Filtering
LongLLMLingua (Jiang et al., 2023b)

Efficient LLMLingua-2 (Pan et al., 2024)


Prompting
Methods
AutoPrompt (Shin et al., 2020)

RLPrompt (Deng et al., 2022)

Promptist (Hao et al., 2022)


Gradient-based
GrIPS (Prasad et al., 2023)

APE (Zhou et al., 2022b)

APO (ProTeGi) (Pryzant et al., 2023)


Efficient
Design
OPRO (Yang et al., 2023)

AEO (Hsieh et al., 2023)

Evolution-based EvoPrompting (Chen et al., 2023a)

EvoPrompt (Guo et al., 2023)

Promptbreeder (Fernando et al., 2023)

Figure 5: Typology of efficient prompting methods

18

You might also like