Prompt Engineering For Healthcare: Methodologies and Applications

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2021 1

Prompt Engineering for Healthcare: Methodologies


and Applications
Jiaqi Wang∗ , Enze Shi∗ , Sigang Yu, Zihao Wu, Chong Ma, Haixing Dai, Qiushi Yang, Yanqing Kang, Jinru Wu,
Huawen Hu, Chenxi Yue, Haiyang Zhang, Yiheng Liu, Yi Pan, Zhengliang Liu, Lichao Sun,
Xiang Li, Bao Ge, Xi Jiang, Dajiang Zhu, Yixuan Yuan, Dinggang Shen, Tianming Liu, and Shu Zhang

Abstract—Prompt engineering is a critical technique in the language processing researchers to better explore the application
field of natural language processing that involves designing and of prompt engineering in this field. We hope that this review can
arXiv:2304.14670v2 [cs.AI] 23 Mar 2024

optimizing the prompts used to input information into models, provide new ideas and inspire for research and application in
aiming to enhance their performance on specific tasks. With the medical natural language processing.
recent advancements in large language models, prompt engineer-
Index Terms—Prompt engineering, Healthcare, Natural lan-
ing has shown significant superiority across various domains and
guage processing, Medical application
has become increasingly important in the healthcare domain.
However, there is a lack of comprehensive reviews specifically
focusing on prompt engineering in the medical field. This review I. I NTRODUCTION
will introduce the latest advances in prompt engineering in
the field of natural language processing for the medical field. Prompt engineering has emerged as a cutting-edge approach
First, we will provide the development of prompt engineering in the field of natural language processing (NLP), providing
and emphasize its significant contributions to healthcare natural a more efficient and cost-effective means of using large
language processing applications such as question-answering language models (LLM). This innovative paradigm is rooted
systems, text summarization, and machine translation. With the in the development of LLMs, which have transformed our
continuous improvement of general large language models, the
importance of prompt engineering in the healthcare domain understanding of natural language [1], [2].
is becoming increasingly prominent. The aim of this article is The history of language models can be traced back to the
to provide useful resources and bridges for healthcare natural 1950s, but it is until the introduction of the BERT and GPT
models in 2018 that LLMs became the mainstream [3]. These
∗ Co-firstauthor models utilize complex algorithms and massive amounts of
Corresponding author: Shu Zhang training data to understand natural language in ways that are
Jiaqi Wang, Enze Shi, Sigang Yu, Yanqing Kang, Jinru Wu, previously unimaginable. Their ability to learn patterns and
Huawen Hu, Chenxi Yue, Haiyang Zhang, Shu Zhang are with
the School of Computer Science, Northwestern Polytechnical Univer- structures in language has proven invaluable in a wide range of
sity, Xi’an 710072, China. Chong Ma is with the School of Au- language-related tasks [4], [5]. However, fine-tuning these pre-
tomation, Northwestern Polytechnical University, Xi’an 710072, China. trained models can be a costly and time-consuming process,
(e-mail: {jiaqi.wang, ezshi, sgyu, yanqing.kang, jinru.wu, huawenhu,
chenxi.yue, haiyang.zhang}@mail.nwpu.edu.cn; shu.zhang@nwpu.edu.cn; requiring significant amounts of annotated data and computing
mc-npu@mail.nwpu.edu.cn). resources. To address this challenge, researchers have turned
Zihao Wu, Haixing Dai, Zhengliang Liu, and Tianming Liu are with the to prompts as a means of guiding model learning [2], [6],
School of Computing, The University of Georgia, Athens 30602, USA. (e-
mail: {zihao.wu1, haixing.dai, zl18864, tliu}@uga.edu). [7]. Prompt learning is a novel paradigm in NLP that enables
Lichao Sun is with the Department of Computer Science and Engineering, language models to perform few or even zero-shot learning,
Lehigh University, PA 18015, USA. (e-mail: lis221@lehigh.edu). adapting to new scenarios with minimal labeled data [1], [2],
Xiang Li is with the Department of Radiology, Massachusetts Gen-
eral Hospital and Harvard Medical School, Boston 02115, USA. (e-mail: [8]. This approach is rooted in language modeling, directly
XLI60@mgh.harvard.edu). modeling the probability of text. The key to prompt engineer-
Dajiang Zhu is with the Department of Computer Science and Engineering, ing is designing prompts for downstream tasks, which guide
The University of Texas at Arlington, Arlington 76019, USA. (e-mail:
dajiang.zhu@uta.edu). the pre-trained model to perform the desired task [9].
Dinggang Shen is with the School of Biomedical Engineering, Shang- Prompt learning can be broken down into five key steps [1].
haiTech University, Shanghai 201210, China; Shanghai United Imaging Intel- First, researchers must choose an appropriate pre-training
ligence Co., Ltd., Shanghai 200230, China; Shanghai Clinical Research and
Trial Center, Shanghai, 201210, China. (e-mail: Dinggang.Shen@gmail.com). model. Next, they must design prompts for downstream tasks,
Qiushi Yang is with the Department of Electronic Engineering, City which can be tailored to the specific requirements of each
University of Hong Kong, Hong Kong 999077, China. Yixuan Yuan is task, and this step is called the prompt engineering process.
with the Department of Electronic Engineering, Chinese University of Hong
Kong, Hong Kong 999077, China. (e-mail: qsyang2-c@my.cityu.edu.hk; The third step involves designing responses based on the
yxyuan@ee.cuhk.edu.hk). task at hand, allowing the model to produce the desired
Yiheng Liu and Bao Ge are with the School of Physics and Informa- output. The fourth step is to expand the paradigm to further
tion Technology, Shaanxi Normal University, Xi’an 710119 China. (e-mail:
{liuyiheng,bob ge}@snnu.edu.cn) improve results or adaptability methods. Finally, researchers
Yi Pan is with the School of Glasgow College, University of Electronic must design training strategies that enable the model to learn
Science and Technology of China, Chengdu 611731, China. Xi Jiang is efficiently and effectively. Prompt engineering is thus a critical
with the School of Life Science and Technology, University of Elec-
tronic Science and Technology of China, Chengdu 611731, China. (e-mail: step in the prompt-based approach, which involves designing
dwaynepan5277@gmail.com; xijiang@uestc.edu.cn) effective prompts to guide the pre-trained language model in
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

downstream tasks. While prompt engineering plays a crucial LLMs, useful information can be effectively obtained from
role in the success of the overall approach, studies have a large number of medical literature and cases. By designing
shown that the design of prompts can significantly impact specific prompts, domain-specific knowledge and task-specific
the performance of the model on downstream tasks. A well- information can be introduced into the pre-trained model,
designed prompt should provide clear guidance to the model resulting in better performance and gaining more extensive
and facilitate effective task completion. Moreover, optimizing attention [15].
prompt parameters is also an important aspect of prompt In the research of prompt engineering, designing appro-
engineering, as it can lead to improved model performance. priate prompts is an important issue. Currently, researchers
The subsequent steps in prompt-based methods, such as are exploring various types of prompts, including manual
designing answers, expanding the paradigm, and adjusting prompts [14]–[19] and automated prompts [6], [20]–[22], as
training strategies, require task-specific prompt design and well as various methods of prompt construction to help solve
optimization. tasks in the medical field.
Overall, prompt engineering is a promising new approach in
NLP that has the potential to revolutionize the field. Its focus B. Literature Search Process
on using prompts to guide model learning provides a more
cost-effective and efficient means of training large language The importance of NLP tasks in the medical field cannot be
models, making it an increasingly important area of research overstated. With the increasing volume of medical data, there
in the years to come [10]. is a growing need for effective NLP techniques to improve
the quality and efficiency of medical services. In this regard,
prompt engineering has emerged as a promising approach
A. Scope and Focus of the Review
to guiding model generation by providing targeted prompt
In recent years, the healthcare industry has received in- information.
creasing attention as it is closely related to each individual. To investigate the application of prompt engineering in the
However, there is still a significant gap between the shortage medical field, we conduct a comprehensive review of literature
of medical resources and people’s needs [11], [12]. NLP from 2019 to 2023 by searching for keywords “prompt” and
technology can handle massive amounts of data from various “medical NLP” on arxiv. As shown in Figure 2, a total of 333
medical literature and medical records, thus helping doctors papers related to the prompt are identified, and ChatGPT is
and patients better understand and manage diseases, which used to screen the abstracts for their relevance to the medical
is of great significance in improving medical standards and field, resulting in the selection of 140 relevant papers for
providing better health security [13]. further review. The dynamic graph in the following link1
shows the daily publication rate of related papers.
These papers mainly focus on the design and application of
prompt engineering in NLP tasks in the medical field, with the
goal of providing suitable prompt design methods for different
medical tasks. Specifically, the studies explore how to choose
and design prompt elements, how to use prompts to guide
a model generation of text that meets medical requirements,
and how to evaluate the impact of different prompt designs on
model performance.

C. Outline of the Review


This review article provides an in-depth overview of Prompt
Fig. 1. A wordcloud is employed to illustrate the research hotspots, focal Engineering, a rapidly growing field of research focusing on
points, and directions in prompt engineering for NLP in the medical domain,
which reflects the key vocabulary and highlights the significant themes of this
enhancing medical applications through the development of
literature review. effective prompts. As illustrated in Figure 3, we present the
overview of this review article with each section organized as
However, traditional machine learning and deep learning follows:
methods cannot solve NLP tasks in the medical field very The introduction of the article discusses the background and
well [3]. The complex and diverse professional terminology in significance of Prompt Engineering and outlines the scope and
the medical field, the wide range of professional knowledge, focus of the review. It describes the literature search process
and ethical issues related to patient privacy have all become followed and provides an outline of the article.
difficult problems in the development of the healthcare indus- Section II presents the basics of Prompt Engineering, in-
try [14]. The emergence of LLMs and prompt-based methods cluding common LLMs, and the elements and components of
provides a new solution for NLP tasks in the medical field. prompts.
As depicted in Figure 1, the wordcloud illustrates the research Section III discusses the different types of prompts available
hotspots and focal points in NLP for the medical domain, in the literature, with a focus on manual prompts such as
highlighting the significant themes of this literature review in
this field. Through the powerful contextual learning ability of 1 https://braininspiredai.github.io/prompt trend in medical
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

Fig. 2. The graphical representation is utilized to depict the number of research papers on prompt engineering for NLP in the medical domain, published from
2019 to April 6, 2023, revealing the trend and growth of this field over time. The graph showcases three different plots: daily submitted count, cumulative
daily submitted count, and annual submitted count.

Fig. 3. The overview of this literature review provides a comprehensive and systematic summary of the current state of research on prompt engineering for
NLP in the medical domain.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

zero-shot and few-shot prompting. The section uses medical bidirectional representations by jointly adjusting the context in
literature to highlight specific details. Automated prompts, in- all layers, and by overcoming the limitations of previous unidi-
cluding discrete and continuous prompting, are also discussed, rectional language models through a masked language model.
and the section compares manual and automated prompting In addition, BERT introduces the next sentence prediction task,
techniques. which can be used in conjunction with a masked language
In Section IV, the various applications of prompts in med- model to pre-train text pair representations [25]. The training
ical fields are covered in detail. These applications include objective of BERT is highly meaningful, as it demonstrates the
classification, generation, detection, augmentation, question importance of bidirectional pre-training for language represen-
answering, and inference tasks, and the section provides the tations. Compared to previous unidirectional language models,
latest medical examples to illustrate the practical use of BERT’s pre-training can better capture semantic information
prompts. in the context [24].
Section V outlines the challenges and future directions BERT is the first fine-tuning based representation model,
of prompt engineering in medical applications. The section which has surpassed many task-specific architectures in 11
highlights the current research directions in the field and NLP tasks, setting new performance records [24]. This
provides insight into the opportunities for future research. achievement demonstrates the enormous potential of pre-
Finally, Section VI concludes the review by summarizing trained models in NLP tasks, and provides important refer-
the key findings and contributions of the article. The section ences for the development of subsequent NLP models.
also discusses the limitations of the review and provides The Text-to-Text Transfer Transformer (T5) [4] is an LLM
recommendations for future research in the field of prompt developed by Google in 2019. The fundamental idea behind
engineering in medical applications. T5 is to treat every NLP task as a “text-to-text” problem.
Overall, the review provides insights into the latest devel- The training objective of T5 is to develop it into a universal
opments in prompt engineering for medical NLP tasks and knowledge representation model, enabling the model to better
highlights the importance of prompt engineering in improving understand and process text. T5 has achieved state-of-the-art
the accuracy and effectiveness of medical services. It also results in multiple NLP tasks, such as text classification, ma-
underscores the need for continued research to explore the chine translation, text summarization, and question answering,
potential of prompt engineering in the medical field and to among others. In these tasks, the text is taken as input, and
identify effective strategies for its implementation. Ultimately, new text is generated as output to solve specific problems. The
prompt engineering has the potential to transform the way advantage of this approach is that by treating all NLP tasks as
medical services are delivered and to enhance the quality of “text-to-text” problems, the T5 model can learn how to map
care for patients. input text to output text, thereby enhancing its understanding
and processing of text.
II. BASIC K NOWLEDGE ChatGPT is a series of generative pre-trained transformer
models developed by the OpenAI team. GPT-1 [26], which
In prompt engineering, LLMs are of great importance as
is released in June 2018, is trained on a large-scale text
they enable the design of appropriate prompts for specific tasks
corpus with the aim of generating coherent and natural text.
and can generate high-quality results even with limited training
Subsequently, GPT-2 [27] is released in February 2019, which
data [16]. This section will delve into the latest advances in
is built on the foundation of GPT-1 with a larger model size
large models and provide a framework for prompt design.
and more training data, achieving outstanding performance on
a variety of NLP tasks. In June 2020, GPT-3 [28] is introduced,
A. Large Language Models utilizing an even larger model size and more extensive training
In this section, we have explored the prevalent LLMs in data than GPT-2, resulting in further improvements in model
prompt engineering within the medical field, which are trained performance. GPT-3.5 then further optimizes the efficiency and
on large amounts of text data and are capable of generating performance of GPT-3.
high-quality, contextually relevant, and coherent text in various To enhance the performance of the model, the OpenAI team
NLP tasks. Among them, prompt-based techniques provide uses supervised and reinforcement learning to fine-tune GPT-
researchers with rich natural language patterns and structures 3.5, with human intervention playing a key role in enhancing
and have become an important tool in prompt engineering. machine learning capabilities [29], [30]. The optimization of
Prompt engineering is achieved through the design of training methods and increased training data has led to GPT-4
prompts, which enables the model to generate diversified being able to generate text that is similar to that produced by
text based on different contextual environments and can be humans, representing a significant breakthrough in the field of
optimized and customized for different tasks and application artificial intelligence.
scenarios [23]. In the medical field, the application of these In the future, LLMs will continue to evolve and improve,
models has provided significant help and inspiration for med- and we can expect their applications in the medical field to
ical research and clinical practice. bring more value and contributions to human health.
BERT (Bidirectional Encoder Representations from Trans-
formers) [24] is a pre-trained model developed by the Google B. Elements and Formats of Prompts
team in 2018 that has bidirectional encoder representations Prompts are usually classified into two forms: cloze
using the transformer architecture. Its purpose is to pre-train prompts [16], [31] and prefix prompts [6], [22]. Cloze prompts
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

address these limitations, researchers have developed various


automated methods to design prompts.

Fig. 4. This study distinguishes between two types of prompts, namely cloze
prompts (illustrated in a) and prefix prompts (illustrated in b), where slot [x]
represents the input text and slot [z] represents the answer text.

refer to a slot that needs to be filled in the middle of the


template text. For example, in sentiment analysis, when X
= “I love this movie,” the template may take the form of “
[X]. Overall, it was a [Z] movie.” Then, the new sentence
X’ becomes “I love this movie. Overall it was a [Z] movie.”
This is an example of cloze prompts as shown in Figure 4(a).
Prefix prompts, on the other hand, refer to the input text being
entirely situated before the generated answer text Z. As shown
in Figure 4(b), “English: [X] Chinese:[Z]”. Fig. 5. Prompt engineering can be broadly categorized into two types: man-
ual prompts and automated prompts. Manual prompts encompass zero-shot
Prompt engineering refers to the process of creating a prompting and few-shot prompting, both of which rely on human expertise
prompt function that guides a model to perform effectively in for their manual configuration. On the other hand, automated prompts consist
downstream tasks. This process typically involves two steps. of discrete prompting and continuous prompting, which involve the design
of a series of automatic algorithms. Discrete prompts are typically human
First, a template is applied, which is a text string that contains interpretable, whereas continuous prompting usually employs learning tokens
two slots: an input slot [X] for the input text x and an answer that are interpretable by computers.
slot [Z] for the intermediately generated answer text z that will
later be mapped into the final output y. Then, the input slot Automated prompts have gained popularity due to their
[X] is filled with the input text x [1]. efficiency and adaptability. These prompts are generated using
It is important to note that in prompt engineering, prompts various algorithms and techniques, eliminating the need for
are not only in the form of natural language but can also be human intervention [6], [20]–[22]. Discrete prompts and con-
generated continuous vector embeddings [22]. This part will tinuous prompts are two common types of automated prompts.
be discussed in detail in section three. By utilizing prompt Discrete prompts rely on predefined categories to generate
engineering techniques, model performance and efficiency responses, while continuous prompts consider the current con-
can be significantly improved, providing better support and versation context to produce accurate outputs [1], [32]–[35].
assistance for downstream tasks. Additionally, there are static prompts and dynamic prompts
that vary in their consideration of historical context [36], [37].
III. T YPES OF P ROMPTS
In the realm of scientific research, prompts play a crucial A. Manual Prompts
role in guiding large models to produce accurate outputs. As Manual prompts can be used to guide models towards
shown in Figure 5 Prompts are broadly categorized into two producing the desired results for specific tasks. These prompts
types: manual prompts and automated prompts [1]. Manual are often created based on human expertise, and are commonly
prompts are meticulously crafted by human experts to provide used in LLMs to improve their performance on downstream
explicit instructions to the model on what type of data to focus tasks. For instance, the LAMA dataset introduced hand-crafted
on and how to approach the task at hand. These prompts are cloze prompts to explore the knowledge of large models [16].
particularly effective when the input data is well-defined and Similarly, prefix prompts can be manually created to facilitate
the output must conform to a specific structure or format. common sense reasoning tasks. In Chat-oriented models like
However, manual prompts have certain limitations that can- ChatGPT, zero-shot prompting is used to guide the models to-
not be overlooked. Creating effective manual prompts requires wards producing ideal results for downstream tasks. Addition-
significant expertise and time, and even minor changes to the ally, it is common to include prompt examples that match the
prompts can have a great impact on model predictions [14]– level of task complexity, to further guide models in producing
[19]. This is especially true for complex tasks where providing accurate outputs. In this section, we mainly discuss the few-
clear and effective prompts is a challenging endeavor. To shot prompting that is manually created through instruction
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

with examples selected by human experience, rather than being such cases, providing a small number of prompt examples
automatically selected. As AI technology rapidly advances, as guidance for the model to achieve better performance has
automated few-shot prompting techniques that select examples become a useful approach. Few-shot prompting can serve as
automatically are also emerging, which will be discussed in clear and explicit prompt inputs to guide the model towards
section 3.2. desired outputs. For instance, research has been conducted
1) Zero-shot Prompting: The superior performance of to measure the baseline performance of GPT-4 on medical
LLMs in a wide range of NLP tasks has garnered signifi- multiple-choice questions (MCQs) using only a few prompt
cant attention due to their remarkable ability to learn from samples, without the need for complex methods such as chain-
context, making them an attractive tool for various research of-thoug [39].
problems. This emerging capability is often enhanced through These prompt methods rely on the contextual emergence
prompt engineering to improve the performance of LLMs in ability of LLMs, which has demonstrated impressive perfor-
downstream tasks. Consequently, researchers have proposed mance on ChatGPT/GPT-4 [40]. The manual prompt-based
various methods for designing prompts to guide the models’ approach has shown typical few-shot prompting capabilities
applications in different research questions [38]. in the medical text translation, text augmentation, generation,
Recently, the development of LLMs such as GPT-3 and summarization, and other tasks, with significant improvements
ChatGPT has made the use of prompt-based methods in- over baseline models on public datasets [19], [41].
creasingly popular for various downstream tasks. Zero-shot
prompting, in particular, has shown great promise, where
B. Automated Prompts
providing a well-designed prompt alone without corresponding
examples can lead to outstanding results. Some studies have The effectiveness of human-designed prompts depends on
suggested that leveraging the powerful performance of GPT the prompt designer’s expertise in selecting relevant informa-
can help us complete downstream tasks such as extraction tion and constructing prompts in a way that is understandable
and recognition, and even outperform some full-shot models to the model. Human prompts are particularly effective for
on certain datasets [17]. In the medical field, the impressive tasks with well-defined input data and structured output re-
achievements of using LLMs to exploit their strong contextual quirements. However, designing effective human prompts re-
capabilities have also been observed. For instance, DeID- quires a significant amount of time and domain-specific knowl-
GPT proposes the use of high-quality prompts to preserve edge, making their usage in more complex tasks challenging.
privacy and summarize key information in medical data on Therefore, researchers are exploring automated prompt design
ChatGPT and GPT-4, achieving better results compared to methods to address these challenges and improve the efficiency
several baseline methods. The study provides a foundation and adaptability of prompt-based approaches.
for future research on applying medical data to LLMs [14]. 1) Discrete Prompting: Discrete prompts refer to automati-
ChatAug proposes a prompt-based medical data augmentation cally searching for templates in a discrete space, typically cor-
method on ChatGPT, which outperforms other popular meth- responding to natural language phrases, to provide the model
ods. The study highlights the importance of domain knowledge with guidance for generating the desired output. This approach
in generating correct augmented data and suggests a fine- can help us design prompts more easily and improve the
tuning approach for future research [15]. Recent work has efficiency and adaptability of the model [1]. Common discrete
investigated the feasibility of using manual prompts to guide prompt construction methods include prompt mining, prompt
downstream tasks on ChatGPT and has shown that Chat- paraphrasing, prompt generation, and prompt scoring [42].
GPT’s performance on translation tasks can be significantly Prompt mining is a method of automatically discovering
improved with clear and accurate prompts [18]. Similarly, templates or prompts by scraping a large text corpus for
research on ChatGPT’s zero-shot prompting has also yielded frequent middle words or dependency paths between a set
similar conclusions [19]. The study presents HealthPrompt, a of input-output pairs. These prompts can be used for various
prompt-based clinical NLP framework that explores different natural language processing tasks such as language modeling,
types of prompt templates, including prefix prompts and text classification, and question answering. The research pro-
cloze prompts, to improve zero-shot learning performance on poses a method for prompt-based learning that automatically
clinical text classification tasks without requiring additional identifies and adds prompt phrases to a prompt template for
training data [32]. These findings highlight the potential of fine-grained detection of Alzheimer’s disease. The method
ChatGPT to enhance models’ performances on NLP tasks with enhances detection accuracy through the use of additional
effective prompt design, providing valuable insights for future prompts [43].
research in this area. Prompt paraphrasing involves generating a set of candidate
Overall, the success of prompt-based methods has shown prompts from an existing seed prompt using methods such as
their great potential for further advancements, and their sig- round-trip translation or phrase replacement, and selecting the
nificance is likely to continue to grow in the future. one that achieves the highest training accuracy for a target task.
2) Few-shot Prompting: While zero-shot prompting has This approach can be optimized using a neural prompt rewriter
shown impressive performance in many tasks, it still has and can be done on an individual input basis. For instance, the
some limitations. First, it heavily relies on the performance article proposes a simple yet effective prompt method based
of pre-trained models. Secondly, due to the flexibility of zero- on paraphrasing to assist pre-trained models in learning rare
shot prompting, its outputs may not always be accurate. In biomedical vocabulary, resulting in improved performance in
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

biomedical applications [44]–[46]. The STREAM framework parameters through tuning on training data for downstream
leverages prompt-based language models to generate task- tasks, which relaxes the constraints on prompts and improves
specific logical rules for named entity tagging, reducing the the efficiency of LLMs in task execution. Meanwhile, contin-
need for human labor. By utilizing existing prompt templates uous prompting has been shown to be an effective alternative
and continuously teaching the language model, STREAM to standard model fine-tuning in cases where the data is highly
achieves higher accuracy and efficiency in the tagging pro- imbalanced [50]. Discrete prompts are composed of discrete
cess [47]. words or phrases, typically in human-interpretable natural
Prompt generation is the task of generating prompts us- language, to guide the model in performing downstream tasks.
ing natural language generation models. The study proposes On the other hand, continuous prompts directly prompt the
MEDIMP, a framework that generates medical prompts us- model in the embedding space, without constraints on natural
ing automatic textual data augmentations from LLMs, and language, and allow the prompts to have their own parameters
demonstrates its superiority in producing high-quality prompts that can be fine-tuned using training data.
for medical applications [33]. The paper proposes a method In the medical field, the advantages of continuous prompting
that utilizes prompts filled with bounding box annotations to have been further demonstrated. Liang et al. use continuous
generate descriptions containing extensive hints and context prompts to introduce prompting into the model architecture as
for instance recognition and localization. The knowledge from the only trainable parameter to guide model output, proposing
language is then distilled into the detection model via maxi- PromptFuse and BlindPrompt as methods for aligning different
mizing cross-modal mutual information in both image and ob- modalities in a modular and parameter-efficient manner. These
ject levels [34]. This approach involves training prompts from methods require only a few trainable parameters and perform
existing data and using these prompts to generate new data. comparably to several mutil-modal fusion methods in low-
This newly generated data can then be used to define additional resource scenarios. The high modularity property of prompting
prompts, ultimately providing downstream tasks with more allows for the flexible addition of modalities at low cost, as
high-quality feature representations [35]. The study introduces it avoids the need to fine-tune large pre-trained models [36].
a prompt construction module that leverages medical language PrefixMol is a generative model that utilizes a learnable
templates to enable pre-trained language models to extract prompt token as prefix embedding to guide the model to
contextual information from tabular electronic health records generate an output that meets the desired criteria. This prompt
and generate more comprehensive cell embeddings, which can token encodes a set of learnable features and contextual infor-
be used by a pre-trained sentence encoder to generate sentence mation about the targeted pocket’s circumstances and various
embeddings as cell representations [48]. properties. This approach offers significant advantages over
Prompt scoring involves using language models to score traditional generative models as it allows for more precise and
filled prompts for a given task and selecting the one with efficient control over the output [37]. Wang et al. propose an
the highest probability. This can result in a custom template adaptive PromptNet for glioma grading that utilizes only non-
for each individual input, as in the case of knowledge- enhanced MRI data and receives constraints from features of
based completion. For instance, ImpressionGPT has proposed contrast-enhanced MR data during training through a designed
dynamic prompts, which select the top-k examples with the prompt loss. In the paper, enhanced features are used as
highest similarity for each input. Based on the probability and prompts to guide the feature extraction of PromptNet, with an
threshold of the k-selected examples, the prompt for each input adaptive strategy designed to dynamically weight the prompt
is determined. This approach, which utilizes few-shot learning, loss in a sample-based manner, thus achieving competitive
has shown better performance than some full-shot learning glioma grading performance on NE-MRI data [51]. CPP
methods [49]. utilizes task-specific prompts optimized with a contrastive
The utilization of various prompt types can greatly benefit prototypical loss to avoid semantic drift and prototype interfer-
language models in their performance of a diverse range of ence, using prompts as learnable tokens to train task-specific
tasks, such as natural language understanding, text generation, information for each task. It also introduces a multi-centroid
machine translation, and question-answering. Each type of prototype strategy to improve prototype representativeness.
prompts, whether generated through prompt mining, prompt CPP outperforms existing methods under a light memory
paraphrasing, prompt generation, or prompt scoring can be budget and improves class separation [52].
used independently or in combination to optimize the per- Overall, continuous prompting is a powerful tool for im-
formance of the language model for a particular task. This proving the performance of automated systems. It helps these
underscores the significance of prompt-based approaches in systems to learn and adapt more quickly, and to perform more
advancing the capabilities of NLP systems. accurately in a wide range of applications. This technique can
2) Continuous Prompting: In recent years, the development be used in combination with other types of prompts, such
of LLMs has made prompt-based methods increasingly pop- as manual prompts and discrete prompts, to create a more
ular in various downstream tasks. In this context, continuous comprehensive and effective learning environment, which is
prompting (e.g., soft prompting) has emerged as a new type of also used in multi-modal tasks such as image and speech
prompts and has gained widespread attention from researchers. recognition to improve the accuracy and speed of the systems.
Unlike discrete prompts, continuous prompts can be operated Additionally, continuous prompts can be tailored to the specific
in the embedding space and are no longer limited to text- needs of the task, which makes it highly adaptable and
readable types. Additionally, continuous prompts can optimize versatile.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

C. Comparison of Manual and Automated Prompts language models. For example, we can input samples
Manual prompts refer to human-crafted prompts, while of all classes into language models like ChatGPT and
automated prompts are generated by algorithms or automated prompt the model to generate samples that preserve
methods. The primary difference between manual and auto- semantic consistency with existing labelled data. The
mated prompts is the level of human involvement in the prompt generated data, together with original samples, are used
construction process. Manual prompts are carefully designed to train classifier models. Augmentation prompts add
by human experts, while automated prompts are generated us- details to expand medical datasets. They are useful for
ing various search algorithms or other automated methods. The boosting model performance on certain medical tasks by
advantage of manual prompts is that they are typically well- synthesizing more data context.
• Question Answering [40], [67]–[73]: Medical prompts
designed and can capture important aspects of the downstream
task. However, the manual prompt design process can be time- pose questions for models to comprehend and accurately
consuming and costly. On the other hand, automated prompts respond to based on medical knowledge. For example,
are generated more quickly and can potentially capture more “What year was radiocontrast invented?” prompts the
task-specific information than manually crafted prompts. How- model to search its medical knowledge base and answer
ever, the effectiveness of automated prompts heavily depends “1923”. Question answering prompts assess if medical
on the quality of the search algorithms and the representation models understand questions and can generate correct
power of the language models. responses by finding relevant medical information or
In summary, both manual prompts and automated prompts facts.
• Inference [44], [74], [75]: Medical prompts present sce-
have their advantages and disadvantages. Manual prompts pro-
vide greater control over the output, while automated prompts narios and have models explain their clinical reasoning or
are more efficient and adaptable. Ultimately, the choice of diagnostic process. For example, “Here are two medical
the prompt will depend on the specific task and available variables: A = [some values], B = [some values]. Explain
resources. how A affects B and why.” The model must logically
relate A and B based on medical knowledge. Reasoning
prompts require coherent medical explanations and infer-
IV. A PPLICATIONS OF P ROMPTS
ence generation. They aim to test clinical logic, causality
Prompt engineering enables language models to achieve reasoning, and argumentation abilities of medical models.
a wide range of AI tasks in the medical field with high In summary, prompt engineering provides a framework of
performance. With customized medical prompts, models can adapting and improving language models for diverse medical
accomplish complex healthcare problems that are previously applications. Prompt design is the key to unlocking model
difficult without huge medical datasets and resources. In the capabilities for different medical problems.
medical field, the applications of prompt engineering include:
• Classification [19], [32], [43], [50], [53]–[57]: Medical
prompts elicit predicted diagnoses, categories or tags A. Classification Task
from models. For example, “The patient’s CT scan results Prompt engineering techniques show great promise for
show: [tumor type]” can steer a model to generate “thy- medical classification tasks when annotated data is scarce.
moma”, etc. Classification prompts provide context for Some studies apply prompt-based learning with pre-trained
model predictions in the medical domain. They are useful language models for clinical and mental health classification.
for condition classification, diagnosis classification, etc. As shown in Table I.
• Generation [18], [38], [58]–[63]: Open-ended medical 1) Medical image analysis: GPT4MIA utilizes the Gener-
prompts give models freedom to generate creative sam- ative Pre-trained Transformer (GPT) as a plug-and-play infer-
ples like case reports, draft research papers, and so on, ence model for medical image analysis (MIA) classification
from their knowledge. For instance, “Here is a possible tasks [53]. The authors theoretically justify using GPT-3, a
case report:” can lead to an original case report. Gen- large pre-trained language model, for MIA. They develop
eration prompts allow unconstrained production of new prompt engineering techniques like improved prompt structure
medical data samples. They are applicable for creative design, sample selection, and prompt ordering to enhance
medical tasks like clinical case modeling and medical GPT4MIA’s efficiency and effectiveness in detecting predic-
article writing. tion errors and improving accuracy for image classification.
• Detection [14], [64], [65]: Medical prompts frame object GPT4MIA, working with established vision models, shows
or anomaly detection contexts and ask models to report strong performance on these tasks.
entities detected. For example, “Detect any abnormal 2) Clinical text classification: HealthPrompt proposes a
findings you observe in the attached CT images”. Detec- novel prompt-based clinical NLP framework that applies
tion prompts imply what medical models need to identify. prompt engineering to pre-trained language models for classi-
They are relevant for locating and classifying diseases, fying clinical texts without training data [32]. Deep learning
anatomical anomalies or other irregularities in medical models need large annotated datasets, but these are lacking for
images, videos, and documents. clinical NLP. HealthPrompt addresses this by using prompt-
• Augmentation [15], [33], [41], [66]: Medical prompts based learning to tune pre-trained language models for new
provide additional context to augment data for training tasks by defining prompt templates instead of fine-tuning the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

TABLE I
A PPLICATIONS OF PROMPT ENGINEERING IN CLASSIFICATION TASKS

Reference Task Prompt type Dataset Highlight


Yang et al. [76] Multi-label classification Discrete US Veterans Health Administration autoregressive model
Corporate Data Warehouse
Wang et al. [51] Auxiliary glioma diganosis Continuous BraTS PromptNet
Kolluru et al. [77] Text classification and genera- Discrete PRONCI MTGEN and UNIGEN model
tion
Niu et al. [78] Lung cancer diagnosis Continuous NLST, LUNA16, LUNG-PET-CT- CLIP and BioGPT
Dx2
Sivarajkumar et Clinical text classification Discrete MIMIC-III Defining a prompt template
al. [32]
Yang et al. [79] Few-shot dialogue state track- Discrete MultiWOZ 2.0 and 2.1 SOLOIST as base model
ing
Min et al. [80] Few-shot text classification Manual SST-2, SST-5, MR, CR, Ama-
zon, Yelp, TREC, AGNews, Ya-
hoo, DBPedia and Subj
Akrout [54] Data augmentation, Skin con- Manual LAION datasets A Clip-based text encoder, U-net
dition classification based latent space generator and
VAE based image decoder
Zhu et al. [55] Kidney stone classification Collect 1496 kidney stone images
from 5 different videos.
Lamichhane [19] Mental health classification Manual Stress Detection Dataset, Depres- ChatGPT (GPT-3.5)
sion Detection Dataset, Suicidality
Detection Dataset
Deyoung et al. [81] Disease inference Discrete The Evidence Inference dataset fine-tuned BERT
Elfrink [50] Early prediction of lung cancer Continuous Data from patients of General Prac- Transformer-based pretrained lan-
titioners. guage models
Wang et al. [43] Alzheimer’s disease diagnosis Discrete ADReSS20 Challenge dataset BERT and RoBERTa
Lin et al. [57] Multiple instance learning Continuous Camelyon16. TCGA-NSCLC
Chen et al. [56] Medical vision-and-language Continuous ROCO, MedICaT, MIMIC-CXR
pre-training
Taylor et al. [82] Clinically meaningful decision MIMIC-III
Yao et al. [83] Distinguishing Disease Symp- Manual BioLAMA
toms
Keicher et al. [84] Diagnosis of disease; Continuous MIMIC-CXR-JPG v2.0.0 CLIP
Diao et al. [85] Molecular property prediction Continuous BBBP, BACE, ClinTox, Tox21, Prompting function
SIDER, HIV, MUV, and ToxCast.
Rao et al. [86] Dense prediction Discrete ADE20K Language-guided fine-tuning with
contextaware prompting
Yao et al. [87] Prompt probing Manual LMC-EHRs On BERT, RoBERTa, BioBERT,
ClinicalBERT, 3 kinds of BioLMs,
and 3 kinds of BlueBERTs.
Zhang et al. [53] Medical image classification, Manual RetinaMNIST, FractureMNIST3D GPT4MIA
Transductive inference datasets
Yang et al. [35] Drug synergy classification Discrete BLIAM
Sung et al. [88] Predict/retrieve biomedical Manual CTD, UMLS, Wikidata BioLAMA
knowledge

models. An analysis of HealthPrompt using six pre-trained images from clinical prompts [58]. By customizing Stable
language models in a no-data setting shows that prompts Diffusion for the medical domain through focusing fine-tuning
effectively capture the context in clinical texts and achieve and assessing performance with clinically tailored metrics, the
good performance without training data. authors translate domain knowledge into model capabilities
3) Mental health classification: ChatGPT, an LLM-based for sensitive generation where data is scarce. Their methods
model, demonstrates strong zero-shot performance in classi- and results highlight the promise of prompt engineering to
fying social media posts for three mental health tasks: stress impart specialized expertise to models for nuanced generation
detection, depression detection, and suicidality detection [19]. to address real-world challenges. However, fully realizing this
LLMs show promise for NLP mental health applications but potential will require continued progress in techniques for
need data, which ChatGPT addresses using prompt-based model adaptation and domain-specific evaluation.
learning. ChatGPT outperforms the baseline model, indicating 2) Medical text generation: NapSS presents a “summarize-
the potential of language models for mental health classifica- then-simplify” strategy to coherently simplify medical text
tion, especially when data is limited. [59]. NapSS generates summaries and narrative prompts to
respectively train a model to extract key content and guide a
B. Generation Task generator to clarify it while preserving the flow of the text.
1) Medical image generation: Chambon et al. demonstrate Evaluated on medical text, NapSS improves on baselines in
how selective fine-tuning and meaningful evaluation enable lexical, semantic, and human metrics. This highlights how
the Stable Diffusion generative model to generate medical multi-stage prompt engineering achieves nuanced, domain-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

TABLE II
A PPLICATIONS OF PROMPT ENGINEERING IN GENERATION TASKS

reference task prompt type dataset highlight


Wang et al. [60] Medical text generation Manual MIMIC-CXR ChatGPT
Chambon et al. [58] Generate medical images Discrete CheXpert, MIMIC-CXR Stable Diffusion model
Yang et al. [89] Medical text generation Discrete MR, IMDB, SNLI BiLSTM and BERT
Lu et al. [59] Automated text simplification Discrete Cochrane paragraph-level medical BERT
text simplification dataset
Lehman et al. [61] Medical question generating DiSCQ, MIMIC-III
Zuccon et al. [72] Medical text generation Manual TREC 2021 and 2022 ChatGPT
Hertz et al. [90] Images generation Manual Prompt-to prompt editing frame-
work
Liu et al. [91] Discrete ACE,ERE
Abaho et al. [38] Health outcome generation Discrete EBM-NLP, EBM-COMET Position-attention mechanism
Gao et al. [37] Generating molecules Continuous CrossDocked PrefixMol
Nori et al. [73] Medical text generation Manual USMLE Sample Exam, USMLE GPT-4
Self Assessments, MedQA, Pub-
MedQA, MedMCQA, MMLU
Kasai et al. [70] Medical text generation Manual IGAKUQA dataset ChatGPT, GPT-3, and GPT-4
Holmes et al. [40] Medical text generation Manual 100-question multiple-choice On ChatGPT (GPT-3.5), Chat-
GPT (GPT-4), Bard (LaMDA), and
BLOOMZ
Chuang et al. [92] Clinical note generation Continuous MIMIC-CXR
Wang et al. [93] Question generation SQuADv1, HotpotQA
Du et al. [94] Natural language generation Discrete DSTC8
Jang et al. [71] Medical note generation Manual Korean National Licensing Exami- ChatGPT3.5 and ChatGPT4
nation
Chen et al. [56] Generation Continuous ROCO MedICaT MIMIC-CXR Vision-Language Pre-trained mode
Lyu et al. [18] Clinical note generation Manual Atrium Health Wake Forest Baptist ChatGPT; GPT-4
clinical database.
Yan et al. [95] Entity clustering and entity set Discrete eCoNLL2003, BC5CDR, WNUT CLIP
expansion 2017, WIKI
Le et al. [96] Text generation Discrete CHEMU-REF GPT-J-6B
Peng et al. [97] Clinical concept extraction and Manual The 2018 National NLP Clinical
relation extraction Challenges and the 2022 n2c2 chal-
lenges

relevant generation requiring sophisticated evaluation. as a key to applying VLMs in medicine and paves the way
3) Medical report translation: Lyu et al. examine the LLM for developing VLMs tailored to healthcare applications.
ChatGPT’s performance in translating radiology reports into 2) Medical text detection: DeID-GPT is the first framework
plain language for education [18]. Evaluated by radiologists, to de-identify text-free medical data using GPT-4’s state-of-
ChatGPT shows promise but some inconsistency improve- the-art NLP capabilities, which achieves the highest accuracy
ments with better prompts. ChatGPT generates both general in removing private information while preserving its original
and specific recommendations based on reports. Compared meaning [14]. It requires no changes for different data types
to GPT-4, which is a newer model, ChatGPT’s performance thanks to GPT-4’s scale and in-context learning. DeID-GPT
is significantly outperformed, indicating LLMs’ potential for uses prompts to generate optimal de-identification results with
clinical use but requiring advanced models and prompts for little human input, showing the potential of ChatGPT and
optimized performance. GPT-4 for automated medical text processing. It contributes a
Further research and applications of prompt engineering in novel and highly effective approach to the important problem -
generation tasks are shown in Table II. enabling the use of medical text data while protecting patient
privacy. DeID-GPT establishes GPT-4 and similar LLMs as
a means to overcome grand challenges in healthcare through
C. Detection Task prompt engineering and natural language understanding.
1) Medical image detection: Qin et. al show how prompt Further research and applications of prompt engineering in
engineering enables vision-language models (VLM) pretrained detection tasks are detailed in Table III.
on natural images to transfer knowledge to medical images
[64]. Manual prompts containing expert medical knowledge
D. Augmentation Task
improve VLM performance on medical detection tasks with
no medical training data. Automatic prompts inject image 1) Text data augmentation: Dai et. al propose a new text
details, further enhancing performance. Proposed approaches data augmentation method called ChatAug [15]. ChatAug uses
outperform default prompts on 13 medical datasets. Fine-tuned the ChatGPT language model to rephrase sentences in the
models beat supervised baselines, demonstrating VLMs can training data into multiple conceptually similar but seman-
learn from limited medical data by transferring knowledge tically different samples, enlarging the sample corpus. They
from natural images. This work establishes prompt engineering apply ChatAug to few-shot learning text classification in the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

TABLE III
A PPLICATIONS OF PROMPT ENGINEERING IN DETECTION TASKS

reference task prompt type dataset highlight


Liu et al. [14] Privacy protection task Manual The i2b2/UTHealth Challenge ChatGPT and GPT-4
Feng et al. [34] Multimodal knowledge learn- Discrete MS-COCO, OpenImages Multimodal knowledge learning
ing Challenge 2019
Zhou et al. [98] Event argument extraction Discrete ACE2005, RAMS, WiKiEvents
Yao et al. [99] Eviction status classification Discrete HER notes
Ruan et al. [48] Medical intervention duration Discrete PASA Dataset, MIMIC-III Language-enhanced transformer-
estimation based framework
Qin et al. [64] Medical image analysis Discrete ISIC 2016, DFUC 2020, CVC-300,
CVC-ClinicDB, CVC-ColonDB,
Kvasir, ETIS, BCCD, CPM-17,
TBX11k, Luna16, ADNI, TN3k
Xing et al. [100] Object detection Discrete EuroSAT, Caltech101, Dual-modality Prompt Tuning
OxfordFlowers, Food101, (DPT) paradigm
FGVCAircraft, DTD, OxfordPets,
StanfordCars, ImageNet1K,
Sun397, and UCF101
Ding et al. [101] Multi Label Recognition Continuous MS-COCO, PASCAL VOC, CLIP
NUSWIDE
Lin et al. [57] Multiple instance learning Continuous Camelyon16, TCGA-NSCLC

TABLE IV
A PPLICATIONS OF PROMPT ENGINEERING IN AUGMENTATION TASKS

Reference Task Prompt type Dataset Highlight


Dai et al. [15] Medical text augmentation Manual PubMed20k Dataset ChatGPT
Milecki et al. [33] Medical text augmentation Discrete DCE MRI, ID-RCB ChatGPT
Lo et al. [102] Mispronunciation detection Discrete MAS
Yuan et al. [41] Privacy-Aware Data Augmen- Manual Clinical Trial Data, Patient EHR LLM-based patient-trial matching
tation Data
Li et al. [66] Image augmentation Manual CIFAR-10, CIFAR-100, Caltech10,
Stanford Cars, Flowers102, Ox-
fordPets,DTD
Yang et al. [103] Few-shot semantic parsing Discrete GeoQuery and EcommerceQuery Filling in sequential prompts with
datasets LMs

medical domain and show improved performance over state- LLMs to specialized domains using a few examples. The re-
of-the-art data augmentation methods in terms of both testing sulting model, Med-PaLM, shows encouraging improvements
accuracy and distribution of the generated samples. but still lags clinicians. This work highlights the importance
2) Transform clinical data into prompts: Milecki et. al of medical QA datasets and human-centered evaluation in
propose a new mutil-modal learning model called MEDIMP - creating beneficial clinical language models. By systematically
Medical Images and Prompts [33]. MEDIMP translates clinical evaluating tuned language models, the authors reveal gaps in
biomedical data into text prompts which as input of LLMs comprehending medical knowledge and reasoning that must
to produce augmented data. Contrastive learning and image- be addressed to develop models for healthcare.
text pairs are then used to learn meaningful representations 2) Medical image-based QA: The open-ended medical vi-
of medical images, i.e., renal transplant DCE MRI images. sual QA work treats the task as a generative process. Sonsbeek
MEDIMP generates prompts using predefined sentence tem- et. al develop a network to map visual features to tokens
plates and ChatGPT, aiming to learn useful representations for that, alone with the question, directly prompt a PLM [68].
prognosis of patient status 2-4 years after transplant. Exploring PLM fine-tuning strategies, their approach generates
Further research and applications of prompt engineering in open-ended responses that outperform others on medical QA
augmentation tasks are shown in Table IV. benchmarks like Slake, OVQA and PathVQA. This enables a
PLM to understand medical images for open-ended medical
visual QA tasks where answers are not constrained to a
E. Question Answering Task predefined set. The work points to promising directions for
1) Medical text-based QA: Singhal et. al present medical open-domain medical visual question answering using prompt-
QA datasets, MultiMedQA and HealthSearchQA, and propose based generative models.
a framework for evaluating clinical LLMs along dimensions 3) Medical video-based QA: Visual-Prompt Text Span Lo-
like factuality, precision, harm and bias [67]. To adapt LLMs calization (VPTSL) proposes a novel approach to the temporal
like PaLM and FlanPaLM for the medical domain, the authors answering grounding in video (TAGV) task by formulating it
apply prompt engineering techniques like instruction prompt as predicting the span of timestamped subtitles that matches
tuning, a parameter-efficient prompting approach for aligning the visual answer [69]. To bridge the semantic gap between
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

TABLE V
A PPLICATIONS OF PROMPT ENGINEERING IN QA TASKS

Reference Task Prompt type Dataset Highlight


Liu et al. [104] Visual question answering Discrete GQA
Liang et al. [36] Visual question answering Continuous VQAv2 PromptFuse and BlindPrompt
prompt
Kasai et al. [70] Answering medical questions Manual IGAKUQA dataset ChatGPT, GPT-3, and GPT-4
Holmes et al. [40] Answering medical questions Manual 100-question multiple-choice ChatGPT (GPT-3.5), ChatGPT
(GPT-4), Bard (LaMDA), and
BLOOMZ
Liu et al. [105] Question Answer Discrete MIT Movie, MIT Restaurant,
CoNLL03
Jang et al. [71] Answering medical questions Discrete Korean National Licensing Exami- ChatGPT3.5, ChatGPT4
nation
Li et al. [69] Medical Instructional Video Continuous MedVidQA
Lee et al. [63] Answering medical questions Manual Diabetes EHR dataset Clinical Decision Transformer

textual questions and visual answers, VPTSL introduces visual task without requiring large amounts of training data. GPT-
prompts—highlight features obtained from video-text high- 4 outperforms ChatGPT, indicating its greater capability. The
lighting using the question. These visual prompts are fed results demonstrate the feasibility of building generic models
into the PLM along with the subtitles and question. A text that can perform well across different domains.
span predictor then models these visual and textual prompts
to predict the subtitle span. VPTSL outperforms the state-of-
the-art on MedVidQA by a large margin (28.36% in mIOU), 2) Causal reasoning about medical variables: Long et. al
showing the effectiveness of visual prompts and the text span examine whether GPT-3 can accurately predict the presence
predictor for medical video QA. This work enables PLMs to or absence of edges between variables in causal graphs based
ground temporal answers in instructional videos by prompting on medical context [75]. They evaluate GPT-3 using differ-
them with contextual visual information. ent prompts (e.g. declarative v.s. interrogative) and linking
Further research and applications of prompt engineering in verbs (e.g. “causes” v.s. “correlates with”). They find GPT-
question answering tasks are detailed in Table V. 3’s performance varies based on these factors, indicating its
sensitivity to user input. With well-designed prompts using
“causes”, GPT-3 achieves over 50% accuracy on the tested
F. Inference Task graphs. Further, the authors select 3 causal graphs of different
1) Natural language inference in radiology: Wu et. al complexities from the medical literature. For each graph, they
evaluate ChatGPT and GPT-4 on a radiology natural language generate 2000 randomly permuted sentence pairs from the
inference task [74]. They implement zero-shot and few-shot graph. GPT-3 predicts whether a causal edge exists between
prompts in the models. The zero-shot prompt provides only each pair. Its accuracy is the highest (70%-85%) on the sim-
task instructions and sentence-question pairs, requiring the plest graph. Though performance decreases for the complex
models to determine entailment with no labeled examples. The graphs, it remains well above the 50% random baselines,
few-shot prompt incorporates 10 labeled sentence-question demonstrating GPT-3’s potential for reasoning about medical
examples to provide in-context learning before the evalua- variables.
tion pair. By designing these prompts with varying levels
of contextual information, the authors show that ChatGPT Further research and applications of prompt engineering in
and GPT-4 can achieve over 50% accuracy on the radiology inference tasks are shown in Table VI.

TABLE VI
A PPLICATIONS OF PROMPT ENGINEERING IN INFERENCE TASKS

Reference Task Prompt type Dataset Highlight


Wang et al. [44] Biomedical natural language Discrete MedNLI, MedSTS
inference
Long et al. [75] Building causal graphs Manual GPT-3
Liévin et al. [106] Medical natural language in- Manual USMLE, MedMCQA, PubMedQA GPT-3.5
ference
Kim et al. [107] Natural language interaction Discrete Extractive and multiple-choice QA GPT-3
Li et al. [108] Arithmetic Reasoning, Com- Discrete GSM8K AsDiv, MultiArith, Davinci, text-davinci-002 and
monsense Reasoning, Induc- SVAMP, SingleEq, Common- codedavinci-002
tive Reasoning senseQA, StrategyQA, CLUTRR
Gao et al. [109] Mathematical problems, Sym- Discrete GSM8K, SVAMP, ASDIV, ProgramAided Language models
bolic reasoning, Algorithmic MAWPS, BIG-Bench Hard
problems
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

G. A Recap of Prompt Engineering for Classification, Gener- generate high-quality predictions [67]. Finally, the prompt
ation, Detection, Augmentation, QA, and Inference design also needs to consider how to avoid introducing any
These studies highlight the potential of using prompt en- human bias or erroneous information to ensure the fairness
gineering for pre-trained language models to advance clinical and accuracy of the model [113].
NLP and a range of medical AI applications, especially when It should be noted that NLP tasks in the medical field are
annotated data is scarce. Prompt-based learning tunes models highly challenging as they involve a large amount of domain
for new tasks by defining task templates instead of fine- knowledge and specialized terminology, which may not be
tuning, helping models achieve strong performance in zero- easily understood or processed by general NLP models [114].
shot learning scenarios where they generalize to new classes Therefore, the importance of prompt engineering in the medi-
without examples. Large pre-trained language models provide cal field is self-evident. These challenges include, but are not
an ideal initialization for prompt engineering across medical limited to:
domains and tasks when combined with effective prompt Data scarcity: In the medical field, many tasks require the
design. use of specific medical data, which is often very scarce and
Though more work is needed, these papers demonstrate the difficult to obtain [14], [83]. Moreover, the special nature
promise of prompt engineering and LLMs for medical AI with of medical data involves ethical issues, which poses a great
limited data. Prompt-based learning could help address lacks challenge to prompt engineering. The lack of sufficient data
of annotated data and advance applications like classification, makes it difficult to design accurate and effective prompts.
generation, detection, enhancement, question answering, and Data uncertainty: The medical field involves numerous
reasoning by unlocking models’ capabilities through tailored domain knowledge and terminology, which may have different
prompts. Strong zero-shot performance suggests prompt engi- interpretations and usage in different texts, leading to increased
neering may reduce data needs for medical AI versus tradi- uncertainty in prompt design [72], [83].
tional supervised learning. Model interpretability: In the medical field, model inter-
pretability is particularly important. As it involves human
V. C HALLENGES AND F UTURE D IRECTIONS health and life, the model’s prediction results need to be
reasonably explained and justified. Therefore, in the medical
This chapter discusses the challenges, current research di-
field, prompt design not only considers the accuracy of the
rections, and future opportunities and development directions
model but also its interpretability [14].
of prompt-based methods.
Additionally, prompt engineering in the medical field faces
Firstly, the challenges in prompt engineering are addressed,
common challenges in other fields, such as prompt engineer-
including the data scarcity [110] in the medical NLP domain,
ing’s self-consistency issues [115], prompt leakage [116], and
the interpretability of models, and inherent issues in prompt
adversarial issues [117], among others. These challenges in
engineering.
medical tasks are more severe than in other fields. Future
Secondly, the current research directions are introduced,
research needs to fully consider these challenges and propose
which include prompt generation [1], [111], prompt opti-
corresponding solutions.
mization [49], mutil-modal data processing [60], and deep
reinforcement learning [112]. These research directions aim In summary, prompt engineering in the medical field is
to improve the effectiveness and applicability of prompt-based critical to the success of NLP tasks, but it also faces various
methods, further promoting the development of the NLP field. challenges, such as data scarcity [118], [119] and uncertainty
Lastly, the opportunities and future development directions [120], model interpretability [121], [122], self-consistency
of Prompt-based methods are explored, such as the devel- [123], and adversarial issues [124]. Addressing these chal-
opment of multitask and mutil-modal processing and the lenges requires innovative solutions and a comprehensive
expansion of innovative Prompt design and intelligent Prompt understanding of the medical domain knowledge and its related
generation [2], [49]. These opportunities and development terminology.
directions provide a broad space and prospects for the future
development of Prompt-based methods. B. Current Research Directions
As the development and application of artificial general
A. Challenges in Prompt Engineering intelligence (AGI) models continue to advance, researchers in
When it comes to NLP tasks in the medical field, prompt the medical field have also begun to apply them to various
engineering faces several challenges. First, medical data is medical tasks. Currently, the research direction of prompt
usually limited and specialized, making it difficult to cover engineering in the medical field is very diverse, covering
all domain knowledge and thus restricting the model’s gener- different types of prompt designs [60], multi-modal data
alization ability [14], [60]. Secondly, the medical field is rich processing [57], and deep reinforcement learning [125], among
in terminology and domain knowledge, but often complex, other aspects.
which requires effective integration of this knowledge into In terms of prompt design methods, it is necessary to con-
prompts [12]. Additionally, different medical tasks require sider the specific characteristics of the task and data features
different prompt designs, which requires balancing the com- to make appropriate choices. These include simple manually
plexity and interpretability of prompts while also utilizing designed template-style prompts, prompts based on knowledge
existing medical domain knowledge to guide the model to graphs [126], prompts generated based on natural language
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

[127], and prompts that can be embedded as trainable parame- tasks in the medical field. By carefully designing prompts
ters in the model for automatic generation [78]. In multi-modal that are tailored to specific tasks and domains, researchers can
data processing, integrating text and image data requires the achieve significant improvements in accuracy and efficiency.
design of appropriate prompts to guide the model to handle and However, there are also several challenges and limitations
analyze different types of data, thereby improving the model’s to prompt engineering that need to be addressed in future
performance and interpretability [128]. Additionally, deep re- research. Moreover, the effectiveness of prompts may depend
inforcement [129] learning can continuously optimize prompt on the specific characteristics of the task and the domain,
design through autonomous learning, further improving the which can vary widely across different medical applications.
performance of the model.
In summary, the current research direction of prompt en- B. Limitations of the Review
gineering in the medical field is diverse, and researchers Despite the comprehensive literature search process and the
can choose appropriate methods and techniques based on the inclusion of 333 relevant studies, it is possible that some rele-
requirements of the task and data characteristics to improve vant research in the field of medical NLP prompt engineering
the performance and interpretability of the model. may have been missed. Therefore, the scope of this review may
not be entirely exhaustive, and there could be other important
C. Opportunities and Future Directions studies that are not covered.
In the field of medicine, prompt engineering presents a Additionally, although the article covers a range of medical
wide range of opportunities and future directions for appli- tasks and the prompt engineering methods employed in each of
cation. The development of LLMs such as ChatGPT and them, there may be other tasks or research methodologies that
GPT-4 provides tremendous potential for prompt engineering are not explored. Future research in this field should, therefore,
in the medical field [130]. As an essential part of prompt- continue to investigate the potential of prompt engineering in
based technology, prompt engineering can assist large models other areas of medical NLP and explore additional methods
in understanding and processing medical data more accu- that may not have been considered in this review.
rately [131], from manual prompts to automated prompts,
improving the automation and intelligence of prompts [132]. C. Conclusion and Contributions
Applying automated algorithms to prompt engineering, uti- This review article makes a significant contribution to the
lizing data-driven methods to generate more intelligent and field by providing a comprehensive and systematic overview
efficient prompts, can enhance the efficiency and accuracy of of prompt engineering methods for NLP tasks in the medical
NLP tasks in the medical field [133]. With the emergence domain. By covering a wide range of methods from manual to
of mutil-modal information in the medical field, combining automated, and from discrete to continuous, the article serves
multiple modalities such as text, images, and speech can as a valuable reference and guide for researchers in the medical
better address practical problems in the medical domain [134]. NLP field to help them choose and design prompts that can
Additionally, conducting multi-task research by integrating improve model performance and efficiency. Additionally, the
multiple medical research tasks can provide better services article discusses the unique challenges posed by medical NLP
for clinical healthcare [135], [136]. In conclusion, prompt tasks and explores the future research directions for prompt
engineering is a promising area for medical NLP research, engineering in this field. Overall, this review article is a
and future studies should focus on developing more intelligent valuable resource for medical NLP researchers and provides
and automated prompt engineering methods and exploring the insightful guidance for future research in this rapidly growing
combination of multiple modalities and tasks. field.

VI. C ONCLUSION
D. Recommendations for Future Research
In this section, we will provide a summary of the review
In summary, the future of NLP in the medical field looks
and discuss its limitations and contributions.
promising, with more advanced techniques such as prompt
engineering and multi-modal integration being explored. We
A. Summary of the Review can expect to see more research in applying state-of-the-art
Overall, this review article provides a comprehensive models like ChatGPT and GPT-4 to medical NLP tasks, as well
overview of the different prompt engineering methods and as developing more efficient and accurate prompt engineering
challenges in the context of NLP tasks in the medical field. methods [137]. Furthermore, there is a need to explore other
We have presented different prompt design methods, including areas of NLP in medical filed [74], such as text-based medical
manual and automated, discrete and continuous, and provided image analysis and building medical knowledge graphs.
examples to illustrate their practical applications in medical As the field continues to evolve, the development of large
settings. Additionally, we have discussed the different prompt models such as GPT-4 will further advance NLP research.
engineering methods used for various medical tasks, such However, this will also require innovation and breakthroughs
as classification, generation, detection, argumentation, recon- in both hardware and software to accommodate the higher
struction, question answering, prediction, and inference tasks. computing and resource demands.
One key finding of this review is that prompt engineering is Overall, future work in this field will require a compre-
a promising approach to improving the performance of NLP hensive and precise approach that incorporates multi-modal
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

information and large model technology to advance medical [16] F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and
NLP research and applications. With the potential for cross- A. Miller, “Language models as knowledge bases?” in Proceedings
of the 2019 Conference on Empirical Methods in Natural Language
disciplinary collaboration between medical professionals and Processing and the 9th International Joint Conference on Natural
NLP researchers, the possibilities for improving healthcare Language Processing (EMNLP-IJCNLP), 2019, pp. 2463–2473.
through NLP are vast. [17] X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang, S. Huang, P. Xie,
J. Xu, Y. Chen, M. Zhang et al., “Zero-shot information extraction via
chatting with chatgpt,” arXiv preprint arXiv:2302.10205, 2023.
VII. ACKNOWLEDGMENT [18] Q. Lyu, J. Tan, M. E. Zapadka, J. Ponnatapuram, C. Niu, G. Wang, and
C. T. Whitlow, “Translating radiology reports into plain language using
This work was supported by Faculty Construction Project chatgpt and gpt-4 with prompt learning: Promising results, limitations,
under Grants No. 22SH0201178, 23SH0201228; Founda- and potential,” arXiv preprint arXiv:2303.09038, 2023.
tional Research in Specialized Disciplines under Grant No. [19] B. Lamichhane, “Evaluation of chatgpt for nlp-based mental health
applications,” arXiv preprint arXiv:2303.15727, 2023.
G2023WD0146; National Natural Science Foundation of [20] E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal
China under Grants No. 62276050; National Natural Sci- adversarial triggers for attacking and analyzing nlp,” in Proceedings
ence Foundation of China under Grants No. 62001410; Na- of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural
tional Natural Science Foundation of China under Grants No. Language Processing (EMNLP-IJCNLP), 2019.
61976131. [21] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh,
“Autoprompt: Eliciting knowledge from language models with auto-
matically generated prompts,” in Proceedings of the 2020 Conference
R EFERENCES on Empirical Methods in Natural Language Processing (EMNLP),
[1] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, 2020, pp. 4222–4235.
prompt, and predict: A systematic survey of prompting methods in [22] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts
natural language processing,” ACM Computing Surveys, vol. 55, no. 9, for generation,” in Proceedings of the 59th Annual Meeting of the
pp. 1–35, 2023. Association for Computational Linguistics and the 11th International
[2] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. El- Joint Conference on Natural Language Processing (Volume 1: Long
nashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern Papers), 2021, pp. 4582–4597.
catalog to enhance prompt engineering with chatgpt,” arXiv preprint [23] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for
arXiv:2302.11382, 2023. vision-language models,” International Journal of Computer Vision,
[3] M. Hu, S. Pan, Y. Li, and X. Yang, “Advancing medical imaging with vol. 130, no. 9, pp. 2337–2348, 2022.
language models: A journey from n-grams to chatgpt,” arXiv preprint [24] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep
arXiv:2304.04920, 2023. bidirectional transformers for language understanding,” in Proceedings
[4] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, of NAACL-HLT, 2019, pp. 4171–4186.
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learn- [25] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked
ing with a unified text-to-text transformer,” The Journal of Machine autoencoders are scalable vision learners,” in Proceedings of the
Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[5] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained 2022, pp. 16 000–16 009.
models for natural language processing: A survey,” Science China [26] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and
Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020. N. Shazeer, “Generating wikipedia by summarizing long sequences,”
[6] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for in International Conference on Learning Representations, 2018.
parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, [27] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,
2021. “Language models are unsupervised multitask learners,” OpenAI blog,
[7] J. Kopač, M. Bahor, and M. Soković, “Optimal machining parameters vol. 1, no. 8, p. 9, 2019.
for achieving the desired surface roughness in fine turning of cold pre- [28] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
formed steel workpieces,” International Journal of Machine Tools and A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod-
Manufacture, vol. 42, no. 6, pp. 707–716, 2002. els are few-shot learners,” Advances in neural information processing
[8] N. Ding, S. Hu, W. Zhao, Y. Chen, Z. Liu, H. Zheng, and M. Sun, systems, vol. 33, pp. 1877–1901, 2020.
“Openprompt: An open-source framework for prompt-learning,” in [29] A. Najar and M. Chetouani, “Reinforcement learning with human
Proceedings of the 60th Annual Meeting of the Association for Com- advice: a survey,” Frontiers in Robotics and AI, vol. 8, p. 584075,
putational Linguistics: System Demonstrations, 2022, pp. 105–113. 2021.
[9] V. Liu and L. B. Chilton, “Design guidelines for prompt engineering [30] W. B. Knox and P. Stone, “Interactively shaping agents via human
text-to-image generative models,” in Proceedings of the 2022 CHI reinforcement: The tamer framework,” in Proceedings of the fifth
Conference on Human Factors in Computing Systems, 2022, pp. 1– international conference on Knowledge capture, 2009, pp. 9–16.
23.
[31] L. Cui, Y. Wu, J. Liu, S. Yang, and Y. Zhang, “Template-based
[10] T. Sorensen, J. Robinson, C. Rytting, A. Shaw, K. Rogers, A. Delorey,
named entity recognition using bart,” in Findings of the Association
M. Khalil, N. Fulda, and D. Wingate, “An information-theoretic ap-
for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 1835–
proach to prompt engineering without ground truth labels,” in Proceed-
1845.
ings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), 2022, pp. 819–862. [32] S. Sivarajkumar and Y. Wang, “Healthprompt: A zero-shot learning
[11] G. Rong, A. Mendez, E. B. Assi, B. Zhao, and M. Sawan, “Artificial paradigm for clinical natural language processing,” arXiv preprint
intelligence in healthcare: review and prediction case studies,” Engi- arXiv:2203.05061, 2022.
neering, vol. 6, no. 3, pp. 291–301, 2020. [33] L. Milecki, V. Kalogeiton, S. Bodard, D. Anglicheau, J.-M. Correas,
[12] L. Zhou and G. Hripcsak, “Temporal reasoning with medical data—a M.-O. Timsit, and M. Vakalopoulou, “Medimp: Medical images and
review with emphasis on medical natural language processing,” Journal prompts for renal transplant representation learning,” in MIDL 2023,
of biomedical informatics, vol. 40, no. 2, pp. 183–202, 2007. 2023.
[13] R. Baud, A.-M. Rassinoux, and J.-R. Scherrer, “Natural language [34] W. Feng, X. Bu, C. Zhang, and X. Li, “Beyond bounding box:
processing and semantical representation of medical texts,” Methods Multimodal knowledge learning for object detection,” arXiv preprint
of information in medicine, vol. 31, no. 02, pp. 117–125, 1992. arXiv:2205.04072, 2022.
[14] Z. Liu, X. Yu, L. Zhang, Z. Wu, C. Cao, H. Dai, L. Zhao, [35] C. Yang, A. Woicik, H. Poon, and S. Wang, “Bliam: Literature-
W. Liu, D. Shen, Q. Li et al., “Deid-gpt: Zero-shot medical text de- based data synthesis for synergistic drug combination prediction,” arXiv
identification by gpt-4,” arXiv preprint arXiv:2303.11032, 2023. preprint arXiv:2302.06860, 2023.
[15] H. Dai, Z. Liu, W. Liao, X. Huang, Z. Wu, L. Zhao, W. Liu, N. Liu, [36] S. Liang, M. Zhao, and H. Schütze, “Modular and parameter-efficient
S. Li, D. Zhu et al., “Chataug: Leveraging chatgpt for text data multimodal fusion with prompting,” arXiv preprint arXiv:2203.08055,
augmentation,” arXiv preprint arXiv:2302.13007, 2023. 2022.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16

[37] Z. Gao, Y. Hu, C. Tan, and S. Z. Li, “Prefixmol: Target-and [58] P. Chambon, C. Bluethgen, C. P. Langlotz, and A. Chaudhari, “Adapt-
chemistry-aware molecule design via prefix embedding,” arXiv preprint ing pretrained vision-language foundational models to medical imaging
arXiv:2302.07120, 2023. domains,” arXiv preprint arXiv:2210.04133, 2022.
[38] M. Abaho, D. Bollegala, P. Williamson, and S. Dodd, “Position-based [59] J. Lu, J. Li, B. C. Wallace, Y. He, and G. Pergola, “Napss: Paragraph-
prompting for health outcome generation,” in Proc. of The 21st BioNLP level medical text simplification via narrative prompting and sentence-
workshop associated with the ACL SIGBIOMED special interest group, matching summarization,” arXiv preprint arXiv:2302.05574, 2023.
2022. [60] S. Wang, Z. Zhao, X. Ouyang, Q. Wang, and D. Shen, “Chatcad:
[39] V. D. Lai, N. T. Ngo, A. P. B. Veyseh, H. Man, F. Dernoncourt, T. Bui, Interactive computer-aided diagnosis on medical image using large
and T. H. Nguyen, “Chatgpt beyond english: Towards a comprehensive language models,” arXiv preprint arXiv:2302.07257, 2023.
evaluation of large language models in multilingual learning,” arXiv [61] E. Lehman, V. Lialin, K. Y. Legaspi, A. J. R. Sy, P. T. S. Pile, N. R. I.
preprint arXiv:2304.05613, 2023. Alberto, R. R. R. Ragasa, C. V. M. Puyat, I. R. I. Alberto, P. G. I.
[40] J. Holmes, Z. Liu, L. Zhang, Y. Ding, T. T. Sio, L. A. McGee, J. B. Alfonso et al., “Learning to ask like a physician,” arXiv preprint
Ashman, X. Li, T. Liu, J. Shen et al., “Evaluating large language arXiv:2206.02696, 2022.
models on a highly-specialized topic, radiation oncology physics,” [62] T. Weber, M. Ingrisch, B. Bischl, and D. Rügamer, “Cascaded latent
arXiv preprint arXiv:2304.01938, 2023. diffusion models for high-resolution chest x-ray synthesis,” arXiv
[41] J. Yuan, R. Tang, X. Jiang, and X. Hu, “Llm for patient-trial match- preprint arXiv:2303.11224, 2023.
ing: Privacy-aware data augmentation towards better performance and [63] S. Lee, D. Y. Lee, S. Im, N. H. Kim, and S.-M. Park, “Clinical
generalizability,” arXiv preprint arXiv:2303.16756, 2023. decision transformer: Intended treatment recommendation through goal
[42] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and prompting,” arXiv preprint arXiv:2302.00612, 2023.
J. Ba, “Large language models are human-level prompt engineers,” [64] Z. Qin, H. Yi, Q. Lao, and K. Li, “Medical image understanding
arXiv preprint arXiv:2211.01910, 2022. with pretrained vision language models: A comprehensive study,” arXiv
[43] Y. Wang, J. Deng, T. Wang, B. Zheng, S. Hu, X. Liu, and H. Meng, preprint arXiv:2209.15517, 2022.
“Exploiting prompt learning with pre-trained language models for [65] Y. Ye, Y. Xie, J. Zhang, Z. Chen, and Y. Xia, “Uniseg: A prompt-
alzheimer’s disease detection,” arXiv preprint arXiv:2210.16539, 2022. driven universal segmentation model as well as a strong representation
[44] H. Wang, C. Liu, N. Xi, S. Zhao, M. Ju, S. Zhang, Z. Zhang, Y. Zheng, learner,” arXiv preprint arXiv:2304.03493, 2023.
B. Qin, and T. Liu, “Prompt combines paraphrase: Teaching pre-trained [66] B. Li, X. Wang, X. Xu, Y. Hou, Y. Feng, F. Wang, and W. Che,
models to understand rare biomedical words,” in Proceedings of the “Semantic-guided image augmentation with pre-trained models,” arXiv
29th International Conference on Computational Linguistics, 2022, pp. preprint arXiv:2302.02070, 2023.
1422–1431. [67] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung,
[45] A. Romanov and C. Shivade, “Lessons from natural language inference N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language
in the clinical domain,” in Proceedings of the 2018 Conference on models encode clinical knowledge,” arXiv preprint arXiv:2212.13138,
Empirical Methods in Natural Language Processing, 2018, pp. 1586– 2022.
1596. [68] T. van Sonsbeek, M. M. Derakhshani, I. Najdenkoska, C. G. Snoek, and
M. Worring, “Open-ended medical visual question answering through
[46] T. Gao, A. Fisch, and D. Chen, “Making pre-trained language models
prefix tuning of language models,” arXiv preprint arXiv:2303.05977,
better few-shot learners,” in Joint Conference of the 59th Annual
2023.
Meeting of the Association for Computational Linguistics and the 11th
[69] B. Li, Y. Weng, B. Sun, and S. Li, “Towards visual-prompt temporal
International Joint Conference on Natural Language Processing, ACL-
answering grounding in medical instructional video,” arXiv preprint
IJCNLP 2021. Association for Computational Linguistics (ACL),
arXiv:2203.06667, 2022.
2021, pp. 3816–3830.
[70] J. Kasai, Y. Kasai, K. Sakaguchi, Y. Yamada, and D. Radev, “Evaluating
[47] T. Chen, L. Liu, X. Jia, B. Cui, H. Tang, and S. Tang, “Distilling task-
gpt-4 and chatgpt on japanese medical licensing examinations,” arXiv
specific logical rules from large pre-trained models,” arXiv preprint
preprint arXiv:2303.18027, 2023.
arXiv:2210.02768, 2022.
[71] D. Jang and C.-E. Kim, “Exploring the potential of large language
[48] Y. Ruan, X. Lan, D. J. Tan, H. R. Abdullah, and M. Feng, “Medical
models in traditional korean medicine: A foundation model approach to
intervention duration estimation using language-enhanced transformer
culturally-adapted healthcare,” arXiv preprint arXiv:2303.17807, 2023.
encoder with medical prompts,” arXiv preprint arXiv:2303.17408,
[72] G. Zuccon and B. Koopman, “Dr chatgpt, tell me what i want to hear:
2023.
How prompt knowledge impacts health answer correctness,” arXiv
[49] C. Ma, Z. Wu, J. Wang, S. Xu, Y. Wei, Z. Liu, L. Guo, X. Cai, preprint arXiv:2302.13793, 2023.
S. Zhang, T. Zhang et al., “Impressiongpt: An iterative optimizing [73] H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz,
framework for radiology report summarization with chatgpt,” arXiv “Capabilities of gpt-4 on medical challenge problems,” arXiv preprint
preprint arXiv:2304.08448, 2023. arXiv:2303.13375, 2023.
[50] A. Elfrink, I. Vagliano, A. Abu-Hanna, and I. Calixto, “Soft-prompt [74] Z. Wu, L. Zhang, C. Cao, X. Yu, H. Dai, C. Ma, Z. Liu, L. Zhao, G. Li,
tuning to predict lung cancer using primary care free-text dutch medical W. Liu et al., “Exploring the trade-offs: Unified large language models
notes,” arXiv preprint arXiv:2303.15846, 2023. vs local fine-tuned models for highly-specific radiology nli task,” arXiv
[51] Y. Wang, W. Huang, C. Li, X. Zheng, Y. Lin, and S. Wang, “Adaptive preprint arXiv:2304.09138, 2023.
promptnet for auxiliary glioma diagnosis without contrast-enhanced [75] S. Long, T. Schuster, A. Piché, S. Research et al., “Can large language
mri,” arXiv preprint arXiv:2211.07966, 2022. models build causal graphs?” arXiv preprint arXiv:2303.05279, 2023.
[52] Z. Li, L. Zhao, Z. Zhang, H. Zhang, D. Liu, T. Liu, and D. N. Metaxas, [76] Z. Yang, S. Kwon, Z. Yao, and H. Yu, “Multi-label few-shot icd coding
“Steering prototype with prompt-tuning for rehearsal-free continual as autoregressive generation with prompt,” in Proceedings of the AAAI
learning,” arXiv preprint arXiv:2303.09447, 2023. Conference on Artificial Intelligence, vol. 37, no. 4, 2023, pp. 5366–
[53] Y. Zhang and D. Z. Chen, “Gpt4mia: Utilizing geneative pre-trained 5374.
transformer (gpt-3) as a plug-and-play transductive model for medical [77] K. Kolluru, G. Stanovsky et al., ““covid vaccine is against covid but
image analysis,” arXiv preprint arXiv:2302.08722, 2023. oxford vaccine is made at oxford!” semantic interpretation of proper
[54] M. Akrout, B. Gyepesi, P. Holló, A. Poór, B. Kincső, S. Solis, noun compounds,” in Proceedings of the 2022 Conference on Empirical
K. Cirone, J. Kawahara, D. Slade, L. Abid et al., “Diffusion- Methods in Natural Language Processing, 2022, pp. 10 407–10 420.
based data augmentation for skin disease classification: Impact across [78] C. Niu and G. Wang, “Ct multi-task learning with a large image-text
original medical datasets to fully synthetic images,” arXiv preprint (lit) model,” bioRxiv, pp. 2023–04, 2023.
arXiv:2301.04802, 2023. [79] Y. Yang, W. Lei, P. Huang, J. Cao, J. Li, and T.-S. Chua, “A dual
[55] W. Zhu, R. Zhou, Y. Yuan, C. Timothy, R. Jain, and J. Luo, “Segprompt: prompt learning framework for few-shot dialogue state tracking,” in
Using segmentation map as a better prompt to finetune deep models for Proceedings of the ACM Web Conference 2023, 2023, pp. 1468–1477.
kidney stone classification,” arXiv preprint arXiv:2303.08303, 2023. [80] S. Min, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Noisy channel
[56] Z. Chen, S. Diao, B. Wang, G. Li, and X. Wan, “Towards unifying language model prompting for few-shot text classification,” arXiv
medical vision-and-language pre-training via soft prompts,” arXiv preprint arXiv:2108.04106, 2021.
preprint arXiv:2302.08958, 2023. [81] J. DeYoung, E. Lehman, B. Nye, I. Marshall, and B. C. Wallace,
[57] Y. Lin, Z. Zhao, Z. ZHU, L. Wang, K.-T. Cheng, and H. Chen, “Evidence inference 2.0: More data, better models,” in Proceedings of
“Exploring visual prompts for whole slide image classification with the 19th SIGBioMed Workshop on Biomedical Language Processing,
multiple instance learning,” arXiv preprint arXiv:2303.13122, 2023. 2020, pp. 123–132.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

[82] N. Taylor, Y. Zhang, D. Joyce, A. Nevado-Holgado, and A. Kormilitzin, [103] J. Yang, H. Jiang, Q. Yin, D. Zhang, B. Yin, and D. Yang, “Seqzero:
“Clinical prompt learning with frozen language models,” arXiv preprint Few-shot compositional semantic parsing with sequential prompts and
arXiv:2205.05535, 2022. zero-shot models,” arXiv preprint arXiv:2205.07381, 2022.
[83] Z. Yao, Y. Cao, Z. Yang, V. Deshpande, and H. Yu, “Extracting [104] Y. Liu, W. Wei, D. Peng, and F. Zhu, “Declaration-based prompt tuning
biomedical factual knowledge using pretrained language model and for visual question answering,” arXiv preprint arXiv:2205.02456, 2022.
electronic health record context,” arXiv preprint arXiv:2209.07859, [105] A. T. Liu, W. Xiao, H. Zhu, D. Zhang, S.-W. Li, and A. Arnold,
2022. “Qaner: Prompting question answering models for few-shot named
[84] M. Keicher, K. Mullakaeva, T. Czempiel, K. Mach, A. Khakzar, and entity recognition,” arXiv preprint arXiv:2203.01543, 2022.
N. Navab, “Few-shot structured radiology report generation using [106] V. Liévin, C. E. Hother, and O. Winther, “Can large language models
natural language prompts,” arXiv preprint arXiv:2203.15723, 2022. reason about medical questions?” arXiv preprint arXiv:2207.08143,
[85] C. Diao, K. Zhou, X. Huang, and X. Hu, “Molcpt: Molecule continuous 2022.
prompt tuning to generalize molecular representation learning,” arXiv [107] Y.-H. Kim, S. Kim, M. Chang, and S.-W. Lee, “Leveraging pre-trained
preprint arXiv:2212.10614, 2022. language models to streamline natural language interaction for self-
[86] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, tracking,” arXiv preprint arXiv:2205.15503, 2022.
and J. Lu, “Denseclip: Language-guided dense prediction with context- [108] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen,
aware prompting,” in 2022 IEEE/CVF Conference on Computer Vision “On the advance of making language models better reasoners,” arXiv
and Pattern Recognition (CVPR). IEEE, 2022, pp. 18 061–18 070. preprint arXiv:2206.02336, 2022.
[87] Z. Yao, Y. Cao, Z. Yang, and H. Yu, “Context variance evaluation of [109] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan,
pretrained language models for prompt-based biomedical knowledge and G. Neubig, “Pal: Program-aided language models,” arXiv preprint
probing,” AMIA Summits on Translational Science Proceedings, vol. arXiv:2211.10435, 2022.
2023, p. 592, 2023. [110] A. Azaria, A. Ekblaw, T. Vieira, and A. Lippman, “Medrec: Using
[88] M. Sung, J. Lee, S. Yi, M. Jeon, S. Kim, and J. Kang, “Can language blockchain for medical data access and permission management,” in
models be biomedical knowledge bases?” in Proceedings of the 2021 2016 2nd international conference on open and big data (OBD). IEEE,
Conference on Empirical Methods in Natural Language Processing, 2016, pp. 25–30.
2021, pp. 4723–4734. [111] Y. He, S. Zheng, Y. Tay, J. Gupta, Y. Du, V. Aribandi, Z. Zhao,
[89] Y. Yang, P. Huang, J. Cao, J. Li, Y. Lin, J. S. Dong, F. Ma, and J. Zhang, Y. Li, Z. Chen, D. Metzler et al., “Hyperprompt: Prompt-based task-
“A prompting-based approach for adversarial example generation and conditioning of transformers,” in International Conference on Machine
robustness enhancement,” arXiv preprint arXiv:2203.10714, 2022. Learning. PMLR, 2022, pp. 8678–8690.
[90] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and [112] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau
D. Cohen-or, “Prompt-to-prompt image editing with cross-attention et al., “An introduction to deep reinforcement learning,” Foundations
control,” in The Eleventh International Conference on Learning Rep- and Trends® in Machine Learning, vol. 11, no. 3-4, pp. 219–354, 2018.
resentations, 2022. [113] E. Crothers, N. Japkowicz, and H. Viktor, “Machine generated text: A
[91] X. Liu, H.-Y. Huang, G. Shi, and B. Wang, “Dynamic prefix-tuning comprehensive survey of threat models and detection methods,” arXiv
for generative template-based event extraction,” in Proceedings of the preprint arXiv:2210.07321, 2022.
60th Annual Meeting of the Association for Computational Linguistics [114] Z. Chen, Q. Zhou, Y. Shen, Y. Hong, H. Zhang, and C. Gan, “See, think,
(Volume 1: Long Papers), 2022, pp. 5216–5228. confirm: Interactive prompting between vision and language models for
[92] Y.-N. Chuang, R. Tang, X. Jiang, and X. Hu, “Spec: A soft prompt- knowledge-based visual reasoning,” arXiv preprint arXiv:2301.05226,
based calibration on mitigating performance variability in clinical notes 2023.
summarization,” arXiv preprint arXiv:2303.13035, 2023. [115] P. Wang, H. Gao, X. Guo, C. Xiao, F. Qi, and Z. Yan, “An experimental
[93] X. Wang, B. Liu, S. Tang, and L. Wu, “Qrelscore: Better evaluating investigation of text-based captcha attacks and their robustness,” ACM
generated questions with deeper understanding of context-aware rele- Computing Surveys, vol. 55, no. 9, pp. 1–38, 2023.
vance,” in Proceedings of the 2022 Conference on Empirical Methods [116] F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques
in Natural Language Processing, 2022, pp. 562–581. for language models,” in NeurIPS ML Safety Workshop, 2022.
[94] Y. Du, S. Oraby, V. Perera, M. Shen, A. Narayan-Chen, T. Chung, [117] K. McGuffie and A. Newhouse, “The radicalization risks of gpt-3 and
A. Venkatesh, and D. Hakkani-Tur, “Schema-guided natural language advanced neural language models,” arXiv preprint arXiv:2009.06807,
generation,” in Proceedings of the 13th International Conference on 2020.
Natural Language Generation, 2020, pp. 283–295. [118] P. Rajpurkar, E. Chen, O. Banerjee, and E. J. Topol, “Ai in health and
[95] A. Yan, J. Li, W. Zhu, Y. Lu, W. Y. Wang, and J. McAuley, “Clip medicine,” Nature medicine, vol. 28, no. 1, pp. 31–38, 2022.
also understands text: Prompting clip for phrase understanding,” arXiv [119] X. Liu, J. Zhao, J. Li, B. Cao, and Z. Lv, “Federated neural architecture
preprint arXiv:2210.05836, 2022. search for medical data security,” IEEE transactions on industrial
[96] N. T. Le, F. Bai, and A. Ritter, “Few-shot anaphora resolution in informatics, vol. 18, no. 8, pp. 5628–5636, 2022.
scientific protocols via mixtures of in-context experts,” in Findings of [120] K. Wang, B. Zhan, C. Zu, X. Wu, J. Zhou, L. Zhou, and Y. Wang,
the Association for Computational Linguistics: EMNLP 2022, 2022, “Semi-supervised medical image segmentation via a tripled-uncertainty
pp. 2693–2706. guided mean teacher model with contrastive learning,” Medical Image
[97] C. Peng, X. Yang, Z. Yu, J. Bian, W. R. Hogan, and Y. Wu, “Clinical Analysis, vol. 79, p. 102447, 2022.
concept and relation extraction using prompt-based machine reading [121] A. M. Barragan-Montero, A. Bibal, M. Huet, C. Draguet, G. Valdes,
comprehension,” Journal of the American Medical Informatics Associ- D. Nguyen, S. Willems, L. Vandewinckele, M. Holmstrom, F. Lofman
ation: JAMIA, p. ocad107. et al., “Towards a safe and efficient clinical implementation of machine
[98] J. Zhou, Q. Zhang, Q. Chen, L. He, and X.-J. Huang, “A multi-format learning in radiation oncology by exploring model interpretability,
transfer learning model for event argument extraction via variational explainability and data-model dependency,” Physics in Medicine &
information bottleneck,” in Proceedings of the 29th International Biology, 2022.
Conference on Computational Linguistics, 2022, pp. 1990–2000. [122] X. Li, H. Xiong, X. Li, X. Wu, X. Zhang, J. Liu, J. Bian, and
[99] Z. Yao, J. Tsai, W. Liu, D. A. Levy, E. Druhl, J. I. Reisman, and D. Dou, “Interpretable deep learning: Interpretation, interpretability,
H. Yu, “Automated identification of eviction status from electronic trustworthiness, and beyond,” Knowledge and Information Systems,
health record notes,” Journal of the American Medical Informatics vol. 64, no. 12, pp. 3197–3234, 2022.
Association, p. ocad081, 2023. [123] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou, “Self-
[100] Y. Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, P. Wang, and Y. Zhang, consistency improves chain of thought reasoning in language models,”
“Dual modality prompt tuning for vision-language pre-trained model,” arXiv preprint arXiv:2203.11171, 2022.
IEEE Transactions on Multimedia, 2023. [124] R. U. Rasool, H. F. Ahmad, W. Rafique, A. Qayyum, and J. Qadir,
[101] Z. Ding, A. Wang, H. Chen, Q. Zhang, P. Liu, Y. Bao, W. Yan, and “Security and privacy of internet of medical things: A contemporary
J. Han, “Exploring structured semantic prior for multi label recognition review in the age of surveillance, botnets, and adversarial ml,” Journal
with incomplete labels,” in Proceedings of the IEEE/CVF Conference of Network and Computer Applications, p. 103332, 2022.
on Computer Vision and Pattern Recognition, 2023, pp. 3398–3407. [125] T. Zhong, Y. Wei, L. Yang, Z. Wu, Z. Liu, X. Wei, W. Li, J. Yao,
[102] T.-H. Lo, S.-Y. Weng, H.-J. Chang, and B. Chen, “An effective end-to- C. Ma, X. Li, D. Zhu, X. Jiang, J.-F. Han, D. Shen, T. Liu, and
end modeling approach for mispronunciation detection,” arXiv preprint T. Zhang, “Chatabl: Abductive learning via natural language interaction
arXiv:2005.08440, 2020. with chatgpt,” arXiv preprint arXiv:2304.11107, 2023.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

[126] B. R. Andrus, Y. Nasiri, S. Cui, B. Cullen, and N. Fulda, “Enhanced


story comprehension for large language models through dynamic
document-based knowledge graphs,” in Proceedings of the AAAI Con-
ference on Artificial Intelligence, vol. 36, 2022, pp. 10 436–10 444.
[127] A. Gilson, C. W. Safranek, T. Huang, V. Socrates, L. Chi, R. A. Taylor,
D. Chartash et al., “How does chatgpt perform on the united states
medical licensing examination? the implications of large language mod-
els for medical education and knowledge assessment,” JMIR Medical
Education, vol. 9, no. 1, p. e45312, 2023.
[128] F. Behrad and M. S. Abadeh, “An overview of deep learning methods
for multimodal medical data mining,” Expert Systems with Applica-
tions, p. 117006, 2022.
[129] B. Jacob, A. Kaushik, P. Velavan, and M. Sharma, “Autonomous
drones for medical assistance using reinforcement learning,” Advances
in Augmented Reality and Virtual Reality, pp. 133–156, 2022.
[130] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li,
M. He, Z. Liu et al., “Summary of chatgpt/gpt-4 research and per-
spective towards the future of large language models,” arXiv preprint
arXiv:2304.01852, 2023.
[131] J. S. Chen and S. L. Baxter, “Applications of natural language pro-
cessing in ophthalmology: present and future,” Frontiers in Medicine,
vol. 9, 2022.
[132] M. Thakur, S. Dhanalakshmi, H. Kuresan, R. Senthil,
R. Narayanamoorthi, and K. W. Lai, “Automated restricted boltzmann
machine classifier for early diagnosis of parkinson’s disease using
digitized spiral drawings,” Journal of Ambient Intelligence and
Humanized Computing, vol. 14, no. 1, pp. 175–189, 2023.
[133] E. A. Martin, A. G. D’Souza, S. Lee, C. Doktorchik, C. A. Eastwood,
and H. Quan, “Hypertension identification using inpatient clinical notes
from electronic medical records: an explainable, data-driven algorithm
study,” Canadian Medical Association Open Access Journal, vol. 11,
no. 1, pp. E131–E139, 2023.
[134] M. A. Azam, K. B. Khan, S. Salahuddin, E. Rehman, S. A. Khan, M. A.
Khan, S. Kadry, and A. H. Gandomi, “A review on multimodal medical
image fusion: Compendious analysis of medical modalities, multimodal
databases, fusion techniques and quality metrics,” Computers in biology
and medicine, vol. 144, p. 105253, 2022.
[135] R. Chengoden, N. Victor, T. Huynh-The, G. Yenduri, R. H. Jhaveri,
M. Alazab, S. Bhattacharya, P. Hegde, P. K. R. Maddikunta, and
T. R. Gadekallu, “Metaverse for healthcare: A survey on potential
applications, challenges and future directions,” IEEE Access, 2023.
[136] Q. D. Buchlak, N. Esmaili, C. Bennett, and F. Farrokhi, “Natural
language processing applications in the clinical neurosciences: A
machine learning augmented systematic review,” Machine Learning in
Clinical Neuroscience: Foundations and Applications, pp. 277–289,
2022.
[137] B. Kutela, K. Msechu, S. Das, and E. Kidando, “Chatgpt’s scientific
writings: A case study on traffic safety,” Available at SSRN 4329120,
2023.

You might also like