Professional Documents
Culture Documents
LLMs Encode Clinical Knowledge
LLMs Encode Clinical Knowledge
https://doi.org/10.1038/s41586-023-06291-2 Karan Singhal1,4 ✉, Shekoofeh Azizi1,4 ✉, Tao Tu1,4, S. Sara Mahdavi1, Jason Wei1,
Hyung Won Chung1, Nathan Scales1, Ajay Tanwani1, Heather Cole-Lewis1, Stephen Pfohl1,
Received: 25 January 2023
Perry Payne1, Martin Seneviratne1, Paul Gamble1, Chris Kelly1, Abubakr Babiker1,
Accepted: 5 June 2023 Nathanael Schärli1, Aakanksha Chowdhery1, Philip Mansfield1, Dina Demner-Fushman2,
Blaise Agüera y Arcas1, Dale Webster1, Greg S. Corrado1, Yossi Matias1, Katherine Chou1,
Published online: 12 July 2023
Juraj Gottweis1, Nenad Tomasev3, Yun Liu1, Alvin Rajkomar1, Joelle Barral1,
Open access Christopher Semturs1, Alan Karthikesalingam1,5 ✉ & Vivek Natarajan1,5 ✉
Large language models (LLMs) have demonstrated impressive capabilities, but the
bar for clinical applications is high. Attempts to assess the clinical knowledge of
models typically rely on automated evaluations based on limited benchmarks. Here,
to address these limitations, we present MultiMedQA, a benchmark combining six
existing medical question answering datasets spanning professional medicine,
research and consumer queries and a new dataset of medical questions searched
online, HealthSearchQA. We propose a human evaluation framework for model
answers along multiple axes including factuality, comprehension, reasoning, possible
harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion
parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using
a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy
on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5
and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6),
including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions),
surpassing the prior state of the art by more than 17%. However, human evaluation
reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-
efficient approach for aligning LLMs to new domains using a few exemplars. The
resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians.
We show that comprehension, knowledge recall and reasoning improve with model
scale and instruction prompt tuning, suggesting the potential utility of LLMs in
medicine. Our human evaluations reveal limitations of today’s models, reinforcing
the importance of both evaluation frameworks and method development in creating
safe, helpful LLMs for clinical applications.
Medicine is a humane endeavour in which language enables key interac- promise in their ability to learn generally useful representations from
tions for and between clinicians, researchers and patients. Yet, today’s the knowledge encoded in medical corpora, at scale. There are several
artificial intelligence (AI) models for applications in medicine and exciting potential applications of such models in medicine, includ-
healthcare have largely failed to fully utilize language. These models, ing knowledge retrieval, clinical decision support, summarization
although useful, are predominantly single-task systems (for example, of key findings, triaging patients, addressing primary care concerns
for classification, regression or segmentation) lacking expressivity and and more.
interactive capabilities7–9. As a result, there is a discordance between However, the safety-critical nature of the domain necessitates
what today’s models can do and what may be expected of them in thoughtful development of evaluation frameworks, enabling research-
real-world clinical workflows10. ers to meaningfully measure progress and capture and mitigate poten-
Recent advances in LLMs offer an opportunity to rethink AI sys- tial harms. This is especially important for LLMs, since these models
tems, with language as a tool for mediating human–AI interaction. may produce text generations (hereafter referred to as ‘generations’)
LLMs are ‘foundation models’11, large pre-trained AI systems that can that are misaligned with clinical and societal values. They may, for
be repurposed with minimal effort across numerous domains and instance, hallucinate convincing medical misinformation or incorpo-
diverse tasks. These expressive and interactive models offer great rate biases that could exacerbate health disparities.
Google Research, Mountain View, CA, USA. 2National Library of Medicine, Bethesda, MD, USA. 3DeepMind, London, UK. 4These authors contributed equally: Karan Singhal, Shekoofeh Azizi, Tao Tu.
1
These authors jointly supervised this work: Alan Karthikesalingam, Vivek Natarajan. ✉e-mail: karansinghal@google.com; shekazizi@google.com; alankarthi@google.com; natviv@google.com
5
60
Medical question answering benchmark +17%
PaLM
(540B) 55 PubMed GPT
DRAGON (2.7B)
HealthSearchQA 50 BioLinkBERT (360M) 50.3
(340M)
45 47.5
PubMedBERT 45.1
Prompting
40 GPT-Neo( 1 0 0 M )
LiveQA MedQA Instruction (2.7B) 38.1
TREC 2017
MultiMedQA (USMLE) tuning 35
33.3
Dec 2020 Jul 2020 Mar 2022 Oct 2022 Dec 2022 Dec 2022
MedicationQA MedMCQA Instruction Q: How long does it take for newborn jaundice to go away?
PubMedQA MMLU prompt tuning
Med-PaLM: Newborn jaundice is when a newborn baby's skin and eyes look yellow. It is
caused by a buildup of a substance called bilirubin in the baby's blood. Bilirubin is a
byproduct of the normal process of red blood cells breaking down. Newborn jaundice is
common and typically harmless, but it can be a sign of a more serious condition. The
jaundice typically goes away on its own within a few weeks. In some cases, it may last
longer or require treatment. The time it takes for the jaundice to go away can vary
depending on the cause and the severity of the jaundice. If the jaundice is severe or
lasts longer than a few weeks, the doctor may recommend testing or treatment to
determine the cause and prevent complications.
Fig. 1 | Overview of our contributions. We curate MultiMedQA, a benchmark clinical topics. In particular, it improves over the previous state of the art on
for answering medical questions spanning medical exam, medical research and MedQA (USMLE) by over 17%. We next propose instruction prompt tuning
consumer medical questions. We evaluate PaLM and its instructed-tuned to further align Flan-PaLM to the medical domain, producing Med-PaLM.
variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting Med-PaLM’s answers to consumer medical questions compare favourably
strategies, Flan-PaLM exceeds state-of-the-art performance on MedQA (US with answers given by clinicians under our human evaluation framework,
Medical Licensing Examination (USMLE)), MedMCQA, PubMedQA and MMLU demonstrating the effectiveness of instruction prompt tuning.
To evaluate how well LLMs encode clinical knowledge and assess Med-PaLM, which was similar to the result for clinician-generated
their potential in medicine, we consider the answering of medical answers (5.7%).
questions. This task is challenging: providing high-quality answers to Although these results are promising, the medical domain is com-
medical questions requires comprehension of medical context, recall plex. Further evaluations are necessary, particularly along the dimen-
of appropriate medical knowledge, and reasoning with expert informa- sions of safety, equity and bias. Our work demonstrates that many
tion. Existing medical question-answering benchmarks3 are often lim- limitations must be overcome before these models become viable
ited to assessing classification accuracy or automated natural language for use in clinical applications. We outline some key limitations and
generation metrics (for example, BLEU12) and do not enable the detailed directions of future research in this Article.
analysis required for real-world clinical applications. This creates an
unmet need for a broad medical question-answering benchmark to
assess LLMs for their response factuality, use of expert knowledge in Key contributions
reasoning, helpfulness, precision, health equity and potential harm. Our first key contribution is an approach for evaluation of LLMs in the
To address this, we curate MultiMedQA, a benchmark comprising context of medical question answering. We introduce HealthSearchQA,
seven medical question-answering datasets, including six existing data- a dataset of 3,173 commonly searched consumer medical questions.
sets: MedQA3, MedMCQA4, PubMedQA5, LiveQA13, MedicationQA14 and We present this dataset alongside six existing open datasets for answer-
MMLU clinical topics6. We introduce a seventh dataset, HealthSearchQA, ing medical questions spanning medical exam, medical research and
which consists of commonly searched health questions. consumer medical questions, as a diverse benchmark to assess the
To assess LLMs using MultiMedQA, we build on PaLM, a 540-billion clinical knowledge and question-answering capabilities of LLMs
parameter (540B) LLM1, and its instruction-tuned variant Flan-PaLM2. (see Methods, ‘Datasets’).
Using a combination of few-shot15, chain-of-thought16 (COT) and We pilot a framework for physician and lay user evaluation to assess
self-consistency 17 prompting strategies, Flan-PaLM achieves multiple axes of LLM performance beyond accuracy on multiple-choice
state-of-the-art performance on MedQA, MedMCQA, PubMedQA and datasets. Our evaluation assesses answers for agreement with the scien-
MMLU clinical topics, often outperforming several strong LLM baselines tific and clinical consensus, the likelihood and possible extent of harm,
by a substantial margin. On the MedQA dataset comprising USMLE- reading comprehension, recall of relevant clinical knowledge, manipu-
style questions, FLAN-PaLM exceeds the previous state of the art by lation of knowledge via valid reasoning, completeness of responses,
more than 17%. potential for bias, relevance and helpfulness (see Methods, ‘Framework
Despite the strong performance of Flan-PaLM on multiple-choice for human evaluation’).
questions, its answers to consumer medical questions reveal key gaps. The second key contribution is demonstrating state-of-the-art per-
To resolve this, we propose instruction prompt tuning, a data- and formance on the MedQA, MedMCQA, PubMedQA and MMLU clinical
parameter-efficient alignment technique, to further adapt Flan-PaLM topics datasets using Flan-PaLM and a combination of prompting strat-
to the medical domain. The resulting model, Med-PaLM, performs egies, surpassing several strong LLM baselines. Specifically, we reach
encouragingly on the axes of our pilot human evaluation framework. 67.6% accuracy on MedQA (more than 17% above the previous state of
For example, a panel of clinicians judged only 61.9% of Flan-PaLM the art), 57.6% on MedMCQA and 79.0% on PubMedQA.
long-form answers to be aligned with scientific consensus, compared The next contribution is the introduction of instruction prompt tun-
with 92.6% for Med-PaLM answers, on par with clinician-generated ing, a simple, data- and parameter-efficient technique for aligning LLMs
answers (92.9%). Similarly, 29.7% of Flan-PaLM answers were rated to the safety-critical medical domain (see Methods, ‘Modelling’). We lev-
as potentially leading to harmful outcomes, in contrast to 5.9% for erage this technique to build Med-PaLM, an instruction prompt-tuned
67.6
anatomy, clinical knowledge, professional medicine, human genetics,
college medicine and college biology. Flan-PaLM 540B achieved
60 57.6 state-of-the-art performance on all these subsets, outperforming
52.9 strong LLMs such as PaLM, Gopher, Chinchilla, BLOOM, OPT and Galac-
50.3
50 tica. In particular, on the professional medicine and clinical knowledge
subsets, Flan-PaLM 540B achieved a state-of-the-art accuracy of 83.8%
and 80.4%, respectively. Extended Data Fig. 2 summarizes the results,
40
providing comparisons with other LLMs where available20.
MedMCQA MedQA (USMLE) PubMedQA
Fig. 2 | Comparison of our method and prior state of the art. Our Flan-PaLM
540B model exceeds the previous state-of-the-art performance (SOTA) on Ablations
MedQA (four options), MedMCQA and PubMedQA datasets. The previous We performed several ablations on three of the multiple-choice
state-of-the-art results are from Galactica20 (MedMCQA), PubMedGPT 19 datasets—MedQA, MedMCQA, and PubMedQA—to better under-
(MedQA) and BioGPT21 (PubMedQA). The percentage accuracy is shown above
stand our results and identify the key components contributing to
each column.
Flan-PaLM’s performance.
Accuracy (%)
presented in Supplementary Information, section 6 suggest that COT
0.750
prompting is effective in priming the model to solve these types of
problems rather than adding new knowledge to the model.
0.725
b
Flan-PaLM 16.1% Inappropriate and/or incorrect content
d
Flan-PaLM 29.7% Extent of possible harm
e
Flan-PaLM 19.4% Likelihood of possible harm
f
Flan-PaLM 7.9%
Possibility of bias
Med-PaLM 0.8%
Yes
Clinician 1.4% No
Fig. 4 | Clinician evaluation of answers. a–f, Clinicians were asked to rate with scientific consensus, harm, missing content and bias, often comparing
answers to questions in the HealthSearchQA, LiveQA and MedicationQA favourably with answers from clinicians, demonstrating the value of instruction
datasets for agreement with scientific and clinical consensus (a), the presence prompt tuning for alignment to the medical domain. The evaluation involves
of incorrect content (b), the omission of content (c), the extent of possible harm 140 questions, each rated by a single clinician. We used the non-parametric
(d), the likelihood of harm (e) and possible bias in answers (f). We compare bootstrap to estimate any significant variation in the results, with 1,000
answers from Flan-PaLM, Med-PaLM and clinicians. Across all axes, answers bootstrap replicas used to produce a distribution for each set. We used the 95%
from clinicians were judged to be better than those from Flan-PaLM. Med-PaLM bootstrap percentile interval to assess variations. Detailed results with intervals
answers were substantially better than Flan-PaLM answers across alignment are presented in Supplementary Information, section 10.
prompt tuning for Med-PaLM (Fig. 5). This trend was observed for all Incorrect or missing content. The goal of this evaluation was to under-
six sub-questions used to evaluate these capabilities. For example, for stand the completeness and correctness of the generated answers by
evidence of correct retrieval of medical knowledge, we found that clini- assessing whether an answer omits any information that it should not
cian answers scored 97.8%, whereas Flan-PaLM scored 76.3%. However, omit, or whether the answer contains any content that it should not.
the instruction prompt-tuned Med-PaLM model scored 95.4%, reducing Where there was deemed to be missing or omitted content, the rater
the performance gap with clinicians. was asked whether it was of great or little potential clinical importance.
a b
Flan-PaLM 90.5% Flan-PaLM 9.2%
Evidence of correct comprehension Evidence of incorrect comprehension
Med-PaLM 97.5% Med-PaLM 5.0%
No Yes
Clinician 97.8% Clinician 2.2%
Yes No
Fig. 5 | Evaluation of comprehension, retrieval and reasoning capabilities by a single clinician. We used the non-parametric bootstrap to estimate any
by clinicians. a,b, Evaluation of correctness (a) and incorrectness (b) of reading significant variation in the results, with 1,000 bootstrap replicas used to
comprehension, recall of knowledge and reasoning steps. The results indicate produce a distribution for each set. We used the 95% bootstrap percentile
a gap between Flan-PaLM and clinicians, and show that Med-PaLM is able to interval to assess variations.
substantially reduce the gap. The evaluation involves 140 questions, each rated
Acknowledgements This project was an extensive collaboration between many teams at Additional information
Google Research, with DeepMind involved in an advisory capacity. We thank M. Howell, Supplementary information The online version contains supplementary material available at
C. Chen, B. Mustafa, D. Fleet, F. Kibria, G. Turner, S. W. Man, D. Kim, B. Hatfield, L. Lehmann, https://doi.org/10.1038/s41586-023-06291-2.
I. Horn, M. Shiels, S. Shetty, J. Zitting, E. Rappaport, L. Marples, V. Sounderajah, A. Connell, Correspondence and requests for materials should be addressed to Karan Singhal,
J. Freyberg, C. Hughes, M. Jones-Bell, S. Thomas, M. Ho, R. Wong, S. Prakash, B. Green, Shekoofeh Azizi, Alan Karthikesalingam or Vivek Natarajan.
E. Dominowska, F. Liu and X. Wang for their valuable insights and feedback during our Peer review information Nature thanks Andrew Beam and the other, anonymous, reviewer(s)
research. We are also grateful to K. DeSalvo, Z. Ghahramani, J. Manyika and J. Dean for their for their contribution to the peer review of this work. Peer reviewer reports are available.
support during the course of this project. Reprints and permissions information is available at http://www.nature.com/reprints.
Extended Data Fig. 1 | Instruction prompt tuning for Med-PaLM. We use prompt tune Flan-PaLM. Med-PaLM is the resulting model, with additional
instructions and exemplars from a panel of qualified clinicians for each of the prompt parameters aligned with the medical domain.
consumer medical question answering datasets and use them to instruction
Article
Extended Data Fig. 2 | Comparison of SOTA LLMs on MMLU clinical topics. Flan-PaLM achieves state-of-the-art performance on MMLU clinical topics.
Extended Data Table 1 | Summary of MultiMedQA describing the format, size, and domain of the datasets in the benchmark
Article
Extended Data Table 2 | Summary of the different axes along which clinicians evaluate the answers in our consumer medical
question answering datasets
These include agreement with scientific consensus, possibility and likelihood of harm, evidence of comprehension, reasoning and retrieval ability, presence of inappropriate, incorrect or
missing content, and possibility of bias in the answer. We use a panel of clinicians to evaluate the quality of model and human-generated answers along these axes.
Extended Data Table 3 | Summary of the different axes along which lay users evaluate the model answers in our consumer
medical question answering datasets
We use a pool of 5 non-expert lay users to evaluate the quality of model and human-generated answers along these axes.
Article
Extended Data Table 4 | Summary of the best performing models on the MedQA (USMLE) dataset questions with 4 options
Med-PaLM was not trained using any of these datasets. These results suggest that instruction prompt tuning aligns the model to the requirements of consumer medical question answering
without affecting base clinical knowledge.
Article
Extended Data Table 6 | Representative explanations generated by the Flan-PaLM 540B model to support its multiple-choice
answers in the MedQA dataset
Extended Data Table 7 | Examples of Med-PaLM responses to questions in the HealthSearchQA dataset
.
Article
Extended Data Table 8 | Examples of HealthSearchQA questions where the physician answers are considered incomplete,
and corresponding Med-PaLM answers
This suggests that LLMs may be a useful complement to physicians in future use cases.
nature portfolio | reporting summary
Karan Singhal
Shekoofeh Azizi
Alan Karthikesalingam
Vivek Natarajan
Corresponding author(s):
Last updated by author(s): "QSJM, 2023
Reporting Summary
Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.
Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
Y The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
Y A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
Y The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Y Give P values as exact values whenever suitable.
Y For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
Y For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Y Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.
Data analysis We will not be able to open source the large language models (LLMs) used in this study. We have provided comprehensive details regarding
our underlying methodology 8FIBWFBMTPSFMFBTFESFMBUFENPEFMTBUIUUQTIVHHJOHGBDFDPHPPHMFGMBOUYM
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and
reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.
March 2021
1
nature portfolio | reporting summary
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A description of any restrictions on data availability
- For clinical datasets or third party data, please ensure that the statement adheres to our policy
The benchmark used in the study, MultiMedQA, comprises six open source datasets and an additional oneon consumer medical questions, HealthSearchQA,
which we newly introduce. )FBMUI4FBSDI2"EBUBTFUJTQSPWJEFEBTBTVQQMFNFOUBSZGJMF.FE2"IUUQTHJUIVCDPNKJOE.FE2" .FE.$2"IUUQT
NFENDRBHJUIVCJP 1VC.FE2"IUUQTQVCNFERBHJUIVCJP -JWF2"IUUQTHJUIVCDPNBCBDIBB-JWF2"@.FEJDBM5BTL@53&$ .FEJDBUJPO2"
IUUQTHJUIVCDPNBCBDIBB.FEJDBUJPO@2"@.FE*OGP ..-6IUUQTIVHHJOHGBDFDPEBUBTFUTIFOESZDLT@UFTU
Recruitment N/A
Note that full information on the approval of the study protocol must also be provided in the manuscript.
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf
Data exclusions We did not apply any special exclusion criteria to the datasets.
Replication We have repeated our experimentsJOEFQFOEFOUMZUISFF times to confirm the accuracy of the resultsGPSUIF.FE2"EBUBTFU.5IFWBSJBODFXBT
NJOJNBMBTEFUBJMFEJOUIFQBQFS
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.
2
Materials & experimental systems Methods
March 2021