LLMs For Knowledge Graph Construction and Reasoning

LLMs for Knowledge Graph Construction and Reasoning:
Recent Capabilities and Future Opportunities

♣∗ ♣∗ ♣∗ ♣ ♣ ♣
Yuqi Zhu , Xiaohan Wang , Jing Chen , Shuofei Qiao , Yixin Ou , Yunzhi Yao
♡ ♣♠ ♣†
Shumin Deng , Huajun Chen , Ningyu Zhang
♣ ♠
Zhejiang University Donghai Laboratory
♡
National University of Singapore
{wangxh07,chenjing_1984,shuofei,ouyixin,yyztodd}@zju.edu.cn,
{huajunsir,zhangningyu}@zju.edu.cn, shumin@nus.edu.sg
Abstract hand, KG reasoning, often referred to as Link Pre-

This paper presents an exhaustive quantita- diction (LP), plays an essential role in making sense
tive and qualitative evaluation of Large Lan- of these constructed KGs (Zhang et al., 2018; Rossi
arXiv:2305.13168v1 [cs.CL] 22 May 2023
guage Models (LLMs) for Knowledge Graph et al., 2021). Additionally, KGs can be employed in
(KG) construction and reasoning. We em- Question Answering (QA) tasks (Karpukhin et al.,
ploy eight distinct datasets that encompass as- 2020; Zhu et al., 2021), by reasoning over relational
pects including entity, relation and event ex- sub-graphs in relation to the questions.
traction, link prediction, and question answer-
ing. Empirically, our findings suggest that Early, the construction and reasoning of KGs
GPT-4 outperforms ChatGPT in the majority have typically relied on supervised learning
of tasks and even surpasses fine-tuned models methodologies. However, in recent years, with the
in certain reasoning and question-answering notable advancements of Large Language Models
datasets. Moreover, our investigation extends (LLMs), researchers have observed their excep-
to the potential generalization ability of LLMs tional ability within the field of natural language
for information extraction, which culminates
processing (NLP). Despite numerous studies on
in the presentation of the Virtual Knowledge
Extraction task and the development of the LLMs (Liu et al., 2023; Shakarian et al., 2023; Lai
VINE dataset. Drawing on these empirical et al., 2023), a systematic exploration of their appli-
findings, we further propose AutoKG, a multi- cation in the KG domain is still limited. To address
agent-based approach employing LLMs for this gap, our work investigates the potential appli-
KG construction and reasoning, which aims cability of LLMs, exemplified by ChatGPT and
to chart the future of this field and offer excit- GPT-4 (OpenAI, 2023), in KG construction, KG
ing opportunities for advancement. We antic- reasoning tasks. By comprehending the fundamen-
ipate that our research can provide invaluable
1 tal capabilities of LLMs, our study further delves
insights for future undertakings of KG .
into potential future directions in the field.
1 Introduction Specifically, as illustrated in Figure 1, we ini-
Knowledge Graph (KG) is a semantic network com- tially investigate the zero-shot and one-shot per-
prising entities, concepts, and relations (Cai et al., formance of LLMs on entity, relation, and event
2022; Chen et al., 2023; Zhu et al., 2022; Liang extraction, link prediction, and question answer-
et al., 2022), which can catalyse applications across ing to assess their potential applications within the
various scenarios such as recommendation systems, KG domain. The empirical findings reveal that, al-
search engines, and question-answering systems though LLMs exhibit improved performance in KG
(Zhang et al., 2021). Typically, KG construction construction tasks, they still lag behind state-of-the-
(Ye et al., 2022b) consists of several tasks, includ- art (SOTA) models. However, LLMs demonstrate
ing Named Entity Recognition (NER) (Chiu and relatively superior performance in reasoning and
Nichols, 2016), Relation Extraction (RE) (Zeng question answering tasks. This suggests that they
et al., 2015; Chen et al., 2022), Event Extraction are adept at addressing complex problems, com-
(EE) (Chen et al., 2015; Deng et al., 2020), and En- prehending contextual relations, and leveraging the
tity Linking (EL) (Shen et al., 2015). On the other knowledge acquired during pre-training. Conse-
∗ quently, LLMs such as GPT-4 exhibit limited ef-
Equal contribution and shared co-first authorship.
†
Corresponding author.
fectiveness as a few-shot information extractor, yet
1
Code and datasets will be available in https://github demonstrate considerable proficiency as an infer-
.com/zjunlp/AutoKG. ence assistant. To further investigate the behavior
1) Recent Capabilities of GPT-4 for KG Construction and Reasoning 2) Discussion: Do LLMs have memorized knowledge or
truly have the generalization ability?
1. Relation Extraction(RE)
2. Event Detection(ED) Construction
Unseen knowledge
3. Link Prediction(LP) & Reasoning
4. Question Answering(QA) VINE
Virtual Knowledge Large Language
Input Knowledge Graph Extraction Models
The performance of models on KG tasks

RE (0-shot)
1.4 3) Future Opportunities: Automatic KG Construction
and Reasoning
1.2
QA (1-shot) RE (1-shot)
1 Hello!
0.8
Hi!
0.6
0.4
0.2 AI User AI Assistant
QA (0-shot) 0 ED (0-shot)
Please construct and complete the knowledge graph
about the movie "Green Book".
Cast Awards
Crew Critical Reception

LP (1-shot) ED (1-shot)
Plots Historical Background
LP (0-shot)
Themes ......
Fine-Tuned Sota GPT-4 ChatGPT text-davinci-003
Figure 1: The overview of our work. There are three main components: 1) Basic Evaluation: detailing our
assessment of large models(text-davinci-003, ChatGPT, and GPT-4), in both zero-shot and one-shot settings, using
performance data from fully supervised state-of-the-art models as benchmarks; 2) Virtual Knowledge Extraction:
an examination of large models’ virtual knowledge capabilities on the constructed VINE dataset; and 3) AutoKG:
the proposal of utilizing multiple agents to facilitate the construction and reasoning of KGs.
of LLMs on information extraction tasks, we de- • We evaluate LLMs, including GPT-3.5, Chat-
sign a novel task known as “Virtual Knowledge Ex- GPT, GPT-4, offering an initial understanding
traction”. This undertaking aims to discern whether of their capabilities by evaluating their zero-
the observed improvements in performance stem shot and one-shot performance on KG con-
from the extensive knowledge repository intrin- struction and reasoning on eight benchmark
sic to LLMs, or from their potent generalization datasets.
capabilities facilitated by instruction tuning and
Reinforcement Learning from Human Feedback
(RLHF) (Christiano et al., 2017). Experimental • We design a novel Virtual Knowledge Extrac-
results on a newly constructed dataset, VINE, in- tion task and construct the VINE dataset. By
dicate that LLMs like GPT-4 can rapidly acquire evaluating the performance of LLMs on this
new knowledge from instructions and accomplish dataset, we further demonstrate that LLMs
relevant extraction tasks effectively. such as GPT-4 possess strong generalization
abilities.
Within those empirical findings, we argue that
the considerable reliance of LLMs on instructions
makes the design of appropriate prompts a labori- • We introduce the concept of employing com-
ous and time-consuming endeavor for KG construc- municative agents for automatic KG construc-
tion and reasoning. To facilitate further research, tion and reasoning, known as AutoKG. Lever-
we introduce the concept of AutoKG, which em- aging the LLMs’ knowledge repositories, we
ploys multiple agents of LLMs for KG construction enable multiple agents of LLMs to assist in
and reasoning automatically. the process of KG construction and reason-
In summary, our research has made the following ing through iterative dialogues, providing new
contributions: insights for future researches.
2 Related Work over, ChatGPT has garnered considerable attention
in other various domains, including information
2.1 Large Language Models
extraction (Gao et al., 2023; Wei et al., 2023b), rea-
LLMs are pre-trained on substantial amounts of soning (Qin et al., 2023), text summarization (Je-
textual data and have become a significant com- blick et al., 2022), question answering (Tan et al.,
ponent of contemporary NLP research. Recent 2023) and machine translation (Jiao et al., 2023),
advancements in NLP have led to the development showcasing its versatility and applicability in the
of highly capable LLMs, such as GPT-3 (Brown broader field of natural language processing.
et al., 2020), ChatGPT, and GPT-4, which exhibit While there is a growing body of research on
exceptional performance across a diverse array of ChatGPT, investigations into GPT-4 continue to be
NLP tasks, including machine translation, text sum- relatively limited. Nori et al. (2023) conducts an
marization, and question answering. Concurrently, extensive assessment of GPT-4 on medical com-
several previous studies have indicated that LLMs petency examinations and benchmark datasets and
can achieve remarkable results in relevant down- shows that GPT-4, without any specialized prompt
stream tasks with minimal or even no demonstra- crafting, surpasses the passing score by over 20
tion in the prompt (Wei et al., 2022; Bang et al., points. Kasai et al. (2023) also studies GPT-4’s
2023; Nori et al., 2023; Qiao et al., 2022; Wei et al., performance on the Japanese national medical
2023b). This provides further evidence of the ro- licensing examinations. Furthermore, there are
bustness and generality of LLMs. some studies on GPT-4 that focus on cognitive
psychology (Sifatkaur et al., 2023), academic ex-
2.2 ChatGPT & GPT-4
ams (Nunes et al., 2023), and translation of radiol-
ChatGPT, an advanced LLM developed by Ope- ogy reports (Lyu et al., 2023).
nAI, is primarily designed for engaging in human-
like conversations. During the fine-tuning process,
3 Recent Capabilities of LLMs for KG
ChatGPT utilizes RLHF (Christiano et al., 2017),
Construction and Reasoning
thereby enhancing its alignment with human pref-
erences and values. Recently, the advent of LLMs has invigorated the
As a cutting-edge large language model devel- field of NLP. With an aim to explore the potential
oped by OpenAI, GPT-4 is building upon the suc- applications of LLMs in the domain of KG, we
cesses of its predecessors like GPT-3 and ChatGPT. select representative models, namely ChatGPT and
Trained on an unparalleled scale of computation GPT-4. We conduct a comprehensive assessment
and data, it exhibits remarkable generalization, in- of their performance across eight disparate datasets
ference, and problem-solving capabilities across in the field of KG construction and reasoning.
diverse domains. In addition, as a large-scale multi-
modal model, GPT-4 is capable of processing both
3.1 Evaluation Principle
image and text inputs. In general, the public re-
lease of GPT-4 offers fresh insights into the future In this study, we conduct a systematic evaluation of
advancement of LLMs and presents novel opportu- LLMs across various KG-related tasks. Firstly, we
nities and challenges within the realm of NLP. assess these models’ capabilities in zero-shot and
With the popularity of LLMs, an increasing one-shot NLP tasks. Our primary objective is to
number of researchers are exploring the specific examine their generalization abilities when faced
emergent capabilities and advantages they possess. with limited data, as well as their capacity to rea-
Bang et al. (2023) performs the in-depth analysis son effectively using pre-trained knowledge in the
of ChatGPT on the multitask, multilingual and mul- absence of demonstrations. Secondly, drawing on
timodal aspects. The findings indicate that Chat- the evaluation results, we present a comprehensive
GPT excels at zero-shot learning across various analysis of the factors that contribute to the models’
tasks, even outperforming fine-tuned models in varying performance in distinct tasks. We aim to
certain cases. However, it faces challenges when explore the reasons and potential shortcomings un-
generalized to low-resource languages. Further- derlying their superior performance in certain tasks.
more, in terms of multi-modality, compared to By comparing and summarizing the strengths and
more advanced vision-language models, the capa- limitations of these models, we seek to offer in-
bilities of ChatGPT remain fundamental. More- sights that may guide future improvements.
(1) SciERC (3) DuIE2.0
ZH ZH
We have implemented a restricted domain parser called Plume. 国家队生涯乔治·威尔康姆在2008年入选洪都拉斯国家队，
他随队参加了2009年中北美及加勒比海地区金杯赛
[we, IMPLEMENTED, restricted domain parser called Plume]

[restricted domain parser, HYPONYM-OF, natural language parser] George Wilcombe was selected for the Honduras national team in 2008,
and he participated in the 2009 North and Central America and Caribbean
[Plume, PART-OF, restricted domain parser] Gold Cup with the team
Translation
[Plume, HYPONYM-OF, restricted domain parser]
[We, USED-FOR, implementing Plume]
[乔治·威尔康姆,国家队生涯,入选洪都拉斯国家队]
[乔治·威尔康姆,参加,2009年中北美及加勒比海地区金杯赛]
[乔治·威尔康姆,参加,洪都拉斯国家队]
(2) RE-TACRED [2008年,入选,乔治·威尔康姆]
[2009年中北美及加勒比海地区金杯赛,参加,
ZH
乔治·威尔康姆随队]
The two projects -- a trachoma prevention plan and a cooking oil plan --
are jointly organized by the New York-based Helen Keller International
-LRB- HKI -RRB-, the United Nations Children's Fund and the World [George Wilcombe, National Team Career, Selected for
Health Organization, the spokesman said, adding that the HKI will Honduras National Team]
implement the two programs using funds donated by Taiwan. [George Wilcombe, Participated, 2009 North and Central
America and Caribbean Gold Cup]
[George Wilcombe, Participated, Honduras national team]
[two projects, org:organized_by, Helen Keller International, [George Wilcombe, Participated, Honduras national team]
UNICEF, World Health Organization] [2008, Selected, George Wilcombe]
[2009 North and Central America and the Caribbean Gold Cup,
[two projects, org:implemented_by, Helen Keller International]
Participated, George Wilcombe with the team]
[two projects, org:funded_by, Taiwan]
Translation
[trachoma prevention plan and cooking oil plan, org:member_of,

Helen Keller International] [乔治·威尔康姆,国籍,洪都拉斯]
[乔治·威尔康姆,成立日期,2008年]
[trachoma prevention plan and cooking oil plan, org:member_of, [乔治·威尔康姆,获奖,2009年中北美及加勒比海地区金杯赛]
United Nations Children's Fund]
[trachoma prevention plan and cooking oil plan, org:member_of,
World Health Organization] [George Wilcombe, Nationality, Honduras]
[Helen Keller International, org:alternate_names, HKI] [George Wilcombe, Date of inception, 2008]
[George Wilcombe, Awards, 2009 North Central American
[HKI, per:employee_of, spokesman]
and Caribbean Gold Cup]
[HKI, org:founded_by, Taiwan]
Translation
Figure 2: Examples of ChatGPT and GPT-4 on the RE datasets. (1) Zero-shot on the SciERC dataset (2) Zero-shot
on the RE-TACRED dataset (3) One-shot on the DuIE2.0 dataset
3.2 KG Construction and Reasoning et al., 2021a) serves as a comprehensive com-

monsense repository with 1.33 million inferential
3.2.1 Settings
knowledge tuples about entities and events.
Datasets
Question Answering FreebaseQA (Jiang et al.,
Entity, Relation and Event Extraction. 2019) is an open-domain QA dataset constructed
DuIE2.0 (Li et al., 2019) represents the largest atop the Freebase knowledge graph, designed
schema-based Chinese relationship extraction specifically for knowledge graph QA tasks. This
dataset in the industry, comprising more than dataset comprises question-answer pairs gathered
210,000 Chinese sentences and 48 predefined from various sources, such as the TriviaQA dataset,
relationship categories. SciERC (Luan et al., 2018) among others. MetaQA (Zhang et al., 2018)
is a collection of scientific abstracts annotated with dataset, expanded from the WikiMovies (Miller
seven relations. Re-TACRED (Stoica et al., 2021) et al., 2016) dataset, provides a substantial collec-
is a notably enhanced version of the TACRED tion of single-hop and multi-hop question-answer
dataset for relation extraction which contains over pairs, surpassing 400,000 in total.
91 thousand sentences spread across 40 relations.
3.2.2 Overall Results
MAVEN (Wang et al., 2020) a general-domain
event extraction benchmark containing 4,480 Entity and Relation Extraction We conduct ex-
documents and 168 event types. periments on SciERC, Re-TACRED, and DuIE2.0,
each involving 20 samples in the test/valid sets,
Link Prediction FB15K-237 (Toutanova et al., and report outcomes using the standard micro F1
2015) is widely used as a benchmark for assessing score for evaluation. Here we use PaddleNLP
2
the performance of knowledge graph embedding LIC2021 IE , PL-Marker (Ye et al., 2022a) and
model on link prediction, encompassing 237 rela- 2
https://github.com/PaddlePaddle/PaddleNLP/tr
tions and 14,541 entities. ATOMIC 2020 (Hwang ee/develop/examples/information_extraction/DuIE
EXOBRAIN (Park and Kim, 2021) as baselines on
each dataset, respectively. As is shown in Table 1, User
GPT-4 achieves relatively good performance on The final medal tally was led by Indonesia,
these academic benchmark extraction datasets in followed by Thailand and host Philippines.
both zero-shot and one-shot manners. It also makes
some progress compared to ChatGPT, even though GPT-4
its performance has not yet surpassed that of fully Zero-shot:
supervised small models. Comparison, Earnings_and_losses, Ranking
• Zero-shot The zero-shot performance of GPT- One-shot:
4 shows advancements across all three investigated Achieve, Competition, Presence, Ranking,
datasets, with the most notable improvement ob- Rewards_and_punishments
served in DuIE2.0. Here, GPT-4 achieves a score
of 31.03, in contrast to ChatGPT’s score of 10.3.
Figure 3: An example of event detection by GPT-4
This result provides further evidence of GPT-4’s
from the MAVEN dataset.
robust capacity for generalization in the extraction
of complex and novel knowledge.
Knowledge Graph Construction
As illustrated by (1) and (2) in Figure 2, GPT- Model
4 exhibits a more substantial improvement in DuIE2.0 Re-TACRED MAVEN SciERC
head-to-tail entity extraction compared to Chat- Fine-Tuned SOTA 69.42 91.4 68.8 53.2
GPT. Specifically, in the example sentence of Re- Zero-shot
TACRED, the target triple is (Helen Keller Inter- text-davinci-003 11.43 9.8 30.0 4.0
ChatGPT 10.26 15.2 26.5 4.4
national, org:alternate_names, HKI). ChatGPT, GPT-4 31.03 15.5 34.2 7.2
however, fails to extract such relations, possibly One-shot
due to the close proximity of head and tail en- text-davinci-003 30.63 12.8 25.0 4.8
tities and the ambiguity of predicates in this in- ChatGPT 25.86 14.2 34.1 5.3
GPT-4 41.91 22.5 30.4 9.1
stance. In contrast, GPT-4 successfully extracts
the "org:alternate_names" relationship between the Table 1: KG Construction tasks (F1 score).
head and tail entities, completing the triple extrac-
tion. This also demonstrates, to a certain extent,
GPT-4’s enhanced language comprehension (read- Nevertheless, during the experiment, GPT-4
ing) capabilities compared to ChatGPT. does not perform flawlessly and struggled with
• One-shot Simultaneously, the optimization of some complex sentences. It is important to note
text instructions can contribute to the enhancement that factors such as dataset noise and ambiguity
of GPT-4’s performance. The result indicates that in the types, including intricate sentence contexts,
introducing a single training sample in one-shot contribute to this outcome. Moreover, to enable
manner further boosts the triple extraction abilities a fair and cross-sectional comparison of different
of both GPT-4 and ChatGPT. models on the dataset, the experiment does not
specify the type of extracted entities in the prompt,
Using the DuIE2.0 dataset as an example, con-
which could also influence the experimental results.
sider the sentence: "George Wilcombe was selected
This observation may help elucidate why GPT-4’s
for the Honduras national team in 2008, and he par-
performance on SciERC and Re-TACRED is infe-
ticipated in the 2009 North and Central America
rior compared to its performance on DuIE2.0.
and Caribbean Gold Cup with the team". The cor-
responding triple should be (George Wilcombe, Na- Event Extraction We conduct experiments on
tionality, Honduras). Although this information is Event Detection on 20 random samples from the
not explicitly stated in the text, GPT-4 successfully MAVEN dataset. Moreover, Wang et al. (2022a)
extracts it. This outcome is not solely attributed is used as the previous fine-tuned SOTA. Concur-
to the single training example providing valuable rently, GPT-4 has achieved commendable results
information; it also stems from GPT-4’s compre- even without the provision of demonstration. Here
hensive knowledge base. In this case, GPT-4 infers we use the F-score as an evaluation metric.
George Wilcombe’s nationality based on his selec- • Zero-shot Table 1 shows that GPT-4 outper-
tion for the national team. forms ChatGPT in the zero-shot manner. Fur-
thermore, using the example sentence "Now an 1) Link Prediction
established member of the line-up, he agreed

ZH
to sing it more often.", ChatGPT generates the One-shot
result Becoming_a_member, while GPT-4 iden- Predict the tail entity [MASK] from the given (Academy Award for
Best Film Editing, award award category category of, [MASK]) by
tifies three event types: Becoming_a_member, completing the sentence "what is the category of of Academy Award
for Best Film Editing? The answer is ".
The answer is Academy Awards, so the [MASK] is Academy Awards.
Agree_or_refuse_to_act, Performing. It is worth
Zero-shot
noting that in this experiment, ChatGPT frequently Predict the tail entity [MASK] from the given (Primetime Emmy
Award for Outstanding Guest Actress - Comedy Series, award award
provides answers with only one event type. In con- category category of, [MASK]) by completing the sentence "what is
the category of of Primetime Emmy Award for Outstanding Guest
trast, GPT-4 is more adept at acquiring contextual Actress - Comedy Series? The answer is".
information, yielding more diverse answers, and

extracting more comprehensive event types. Con-
Zero-shot:
sequently, GPT-4 achieves superior results on the The answer is "Comedy Series".
MAVEN dataset, which inherently contains sen- One-shot:

The answer is Primetime Emmy Awards. So, the [MASK] is Primetime
tences that may have one or more relationships. Emmy Awards.
• One-shot During the one-shot experiments,

we observe that ChatGPT’s performance improved 2) Question Answering
more significantly under the same settings, whereas ZH
When did the films written by [A Gathering of Old Men] writers release ?
GPT-4 experienced a slight decline in performance.
The inclusion of a single demonstration rectifies Answer: 1983|1987|1984|1993|2009|2010|1995
ChatGPT’s erroneous judgments made under the
zero-shot setting, thereby enhancing its perfor- Answer: 1999|1974|1987|1995|1984
mance. In this context, we will concentrate on

analyzing GPT-4. As depicted in Figure 3, the Figure 4: Examples of Link Prediction and Question
example sentence reads, "The final medal tally Answering.
was led by Indonesia, followed by Thailand and
host Philippines". The event types provided in
the dataset are Process_end and Come_together. (BERT-base) (Wang et al., 2022c) for FB15k-237
However, GPT-4 generates three results: Compar- and COMET (BART) (Hwang et al., 2021b) for
ison, Earnings_and_losses, Ranking. During the ATOMIC2020.
process, GPT-4 indeed notices the hidden ranking
• Zero-shot In Table 2, GPT-4 outperforms
and comparison information within the sentence
both text-davinci-003 and ChatGPT models in the
but overlooks the trigger word final corresponding
zero-shot link prediction task. Notably, GPT-4 on
to Process_end and the trigger word host corre-
the FB15k-237 demonstrates that its hits@1 score
sponding to Come_together. Simultaneously, our
achieves a state-of-the-art level, surpassing the per-
observations indicate that under the one-shot setup,
formance of the fine-tuned model. Regarding the
GPT-4 tends to produce a higher number of erro-
ATOMIC2020, while GPT-4 still exceeds the other
neous responses when it is unable to correctly iden-
two models, there remains a considerable discrep-
tify the type. This, in part, contributes to GPT-4’s
ancy in terms of blue1 score between GPT-4’s per-
performance decrease on the sample.
formance and the fine-tuned SOTA achieved.
We hypothesize this might be due to the types
Upon closer examination of the responses in
provided in the dataset not being explicit. Further-
zero-shot, it is apparent that ChatGPT exhibits a
more, the presence of multiple event types within
tendency to withhold a direct answer in cases where
a single sentence further adds complexity to such
there exists ambiguity in the predicted link, and
tasks, leading to suboptimal outcomes.
proactively requests additional contextual informa-
Link Prediction Task link prediction involves tion to resolve such ambiguity. This behavior is not
experiments sampled on two distinct datasets, as frequently observed in the GPT-4, which tends
FB15k-237 and ATOMIC2020. The former is a ran- to provide an answer directly. This observation
dom sample set comprising 25 instances, whereas implies a potential difference in the reasoning and
the latter encompasses 23 instances on behalf of all decision-making processes employed.
possible relations. Among various approaches, the • One-shot The optimization of text instructions
best performing fine-tuned models are C-LMKE also has proven to be a fruitful approach in enhanc-
Knowledge Graph Reasoning of a Prolog query, which can then be used to answer
Model / Question Answering
the query programmatically. The SOTA model is
FB15K-237 ATOMIC2020 FreebaseQA MetaQA
Fine-Tuned SOTA 32.4 46.9 79.0 100
capable of correctly answering all questions in the
Zero-shot test dataset of MetaQA. One possible reason for
text-davinci-003 16.0 15.1 95.0 33.9 this gap may be the presence of questions with
ChatGPT 24.0 10.6 95.0 52.7 multiple answers, as opposed to the FreebaseQA
GPT-4 32.0 16.3 95.0 63.8
One-shot
dataset, where questions typically have only one
text-davinci-003 32.0 14.1 95.0 49.5
answer. Furthermore, the knowledge graph pro-
ChatGPT 32.0 11.1 95.0 50.0 vided by the MetaQA dataset cannot be fed into
GPT-4 40.0 19.1 95.0 56.0
the LLMs due to the input token length limitation,
Table 2: KG Reasoning(Hits@1 /blue1) and Question and LLMs can only rely on their inner knowledge
Answering (AnswerExactMatch). to conduct multi-hop reasoning, therefore, LLMs
often fail to cover all the correct answers.
Nevertheless, GPT-4 outperforms text-davinci-
ing the performance of the GPT series for the link 003 and ChatGPT by 29.9 points and 11.1 points re-
prediction task. The empirical evaluation reveals spectively, which indicates the superiority of GPT-
that the one-shot GPT-4 yields improved results on 4 against text-davinci-003 and ChatGPT on more
both datasets, which aids in accurate prediction of challenging question answering tasks. Specifically,
the tail entity for the triple. the example question in Figure 4 from the MetaQA
In the example of Figure 4: (Primetime Emmy dataset is: "when did the films written by [A Gath-
Award for Outstanding Guest Actress - Comedy Se- ering of Old Men] writers release ?". Answering
ries, award award category category of, [MASK]), this question requires a multi-hop reasoning pro-
the target [MASK] is Primetime Emmy Award. In cess that involves connecting movies to writers to
the zero-shot setting, GPT-4 fails to comprehend other movies and finally to the year of release. Re-
the relation properly, leading to an incorrect re- markably, GPT-4 is able to answer this question
sponse of Comedy Series. However, when the correctly, providing both the 1999 and 1974 re-
demonstration is incorporated, GPT-4 can success- lease dates. In contrast, ChatGPT and text-davinci-
fully identify the target tail entity. 003 fail to provide the correct answer, highlighting
the superior performance of GPT-4 in multi-hop
Question Answering We perform evaluation on question-answering tasks.
two widely used Knowledge Base Question An- • One-shot We also conduct experiments un-
swering datasets: FreebaseQA and MetaQA. We der one-shot setting by randomly sampling one
randomly sample 20 instances from each dataset. example from the train set as the in-context exem-
For MetaQA, which consists of questions with dif- plar. Results in Table 2 demonstrate that only text-
ferent numbers of hops, we sample according to davinci-003 benefits from the prompt, while both
their proportions in the dataset. The evaluation met- ChatGPT and GPT-4 encounter a performance drop.
ric we use for both datasets is AnswerExactMatch. This can be attributed to the notorious alignment
• Zero-shot As shown in Table 2, text-davinci- tax where models sacrifice some of their in-context
003, ChatGPT and GPT-4 have the same perfor- learning ability for aligning with human feedback.
mance on the FreebaseQA dataset, and all of them
surpass the previous fully supervised SOTAs, with 3.2.3 KG Construction vs. Reasoning
a 16% improvement. The supervised SOTA (Yu In the conducted experiments spanning KG con-
et al., 2022) on the FreebaseQA dataset proposes struction and KG reasoning, LLMs generally
a novel framework DECAF that jointly generates demonstrate superior performance in reasoning ca-
both logical forms and direct answers. However, pabilities compared to their construction abilities.
GPT-4 shows no advantage over either text-davinci- For KG construction tasks, LLMs do not exceed
003 or ChatGPT. As for the MetaQA, there is still the performance of the current state-of-the-art mod-
a large gap between the large language models els in both zero-shot and one-shot manners. This
and supervised SOTAs. The supervised SOTA aligns with prior experiments (Ma et al., 2023) con-
model (Madani and Joseph, 2023) equips T5-small ducted on Information Extraction tasks, suggesting
with classic logical programming languages and ex- that Large Language Models are typically not profi-
tracts the representation of the question in the form cient few-shot information extractors. Conversely,
in KG reasoning tasks, all LLMs in the one-shot paratively modest. This observation implies that
setting, and GPT-4 in the zero-shot setting, achieve introducing a single sample might have a limited
SOTA performance. Such findings furnish mean- impact on more specialized data, and a greater per-
ingful insights for enhancing our understanding of formance enhancement could be anticipated with
the large models’ performance and their adaptabil- an increased number of samples.
ity within the domain of KG. We hypothesize that the inadequate performance
We posit several potential explanations for this of large models on domain-specific datasets could
phenomenon: First, the KG construction task en- be attributed to their training predominantly on
compasses the identification and extraction of enti- large-scale, general corpora, which may not en-
ties, relations, events, etc., rendering it more com- compass enough specialized domain knowledge.
plex than the reasoning task. Conversely, the rea- Additionally, within specialized datasets, certain
soning task, typified by link prediction, primarily entities, relations, or concepts may exhibit a long-
relies on existing entities and relations for infer- tail distribution. Given the infrequent occurrence
ence, making the task comparatively straightfor- of these rare elements in the pre-training data, it
ward. Second, we conjecture that the superior per- may be challenging for large models to effectively
formance of LLMs in reasoning tasks might be learn and represent them, consequently affecting
attributable to their exposure to pertinent knowl- their task performance. Therefore, future research
edge during the pre-training phase. needs to consider how to better guide large models
in learning and adapting when dealing with special-
3.2.4 General vs. Specific Domain
ized domain issues.
To explore the performance disparities of large
models, such as ChatGPT and GPT-4, in general 3.3 Discussion: Why LLMs do not present
versus domain-specific contexts, we select tasks satisfactory performance on some tasks?
spanning different knowledge areas for this experi-
ment, ensuring coverage of both general and spe- The experimental outcomes discussed above
cialized domains. demonstrate that large models like GPT-4 are capa-
In particular, we concentrate on the SciERC and ble of extracting knowledge across various types
the Re-TACRED dataset. The SciERC dataset, fo- and domains. Nevertheless, it is also observed
cusing on the scientific domain, comprises numer- that GPT-4’s performance does not yet surpass that
ous article abstracts and entities/relations related of fine-tuned models, aligning with findings from
to scientific research. In contrast, the Re-TACRED previous research (Wei et al., 2023b; Gao et al.,
dataset targets the general domain, encompassing 2023). Notably, the test results for ChatGPT and
various entity and relation types. Analyzing these GPT-4 involve randomly sampled subsets of the
two datasets enables a comprehensive understand- dataset and are assessed through an interactive in-
ing of LLMs’ performance across different do- terface, as opposed to an API. Consequently, these
mains, thereby shedding light on their potential outcomes may be subject to the influence of the
strengths and limitations for specific tasks. data distribution and sampling sample of the test
The study’s findings reveal that both ChatGPT set. Moreover, the construction of prompts and
and GPT-4 exhibit comparatively weaker perfor- the inherent complexity of the dataset considerably
mance on the SciERC dataset than on the Re- impact the experimental outcomes.
TACRED dataset. Despite the broader range of Notably, in assessing large models across eight
relation types in the Re-TACRED compared to the datasets, we identify that the outcomes may be
seven found in SciERC, the increased variety does subject to various factors. Dataset quality: Us-
not result in a performance decline. This analysis ing the knowledge graph construction task as an
suggests that LLMs still have limitations in recog- illustration, the presence of noise in the dataset
nizing more specialized data, with the performance could lead to ambiguity in certain types (e.g., in
gap becoming more pronounced when compared the relation extraction task, the dataset might not
to fully supervised SOTA. Using GPT-4 as an illus- explicitly provide head and tail entity types). More-
tration, the introduction of a single training sam- over, the dataset may contain highly complex con-
ple leads to a notable performance enhancement texts and potentially inaccurate labels. These fac-
on Re-TACRED, from 15.5 to 22.5. Nevertheless, tors could collectively contribute to a negative im-
the performance gain on SciERC remains com- pact on the model’s performance during evalua-
tion. Instruction quality: The semantic richness structions to have the model extract this type of
of instructions substantially influences model per- virtual knowledge. The effectiveness of this extrac-
formance. Experimenting with various instruction is utilized to gauge the large model’s capability
tions to identify the most effective version can in handling unseen knowledge. We construct VINE
potentially enhance performance. Research on based on the test set of the Re-TACRED dataset.
3
prompt engineering , grounded in large models, The primary idea behind the construction process
aims to fully harness these models’ capabilities. is to replace existing entities and relations in the
Moreover, integrating relevant samples within an original dataset with unseen ones, thereby creating
In-context Learning (Dong et al., 2023) strategy unique virtual knowledge scenarios.
could further improve performance. Significantly,
3.4.1 Data Collection
Code4Struct (Wang et al., 2022b) demonstrates that
employing a code-based prompt can facilitate the Considering the vast training datasets of large mod-
model’s performance in extracting structured infor- els like GPT-4, it is challenging for us to find the
mation. Evaluation methods: Existing evaluation knowledge that they do not recognize. Using GPT-
approaches might not be entirely suitable for gaug- 4 data up to September 2021 as a basis, we select a
ing the capabilities of large models like ChatGPT portion of participants’ responses from two compe-
and GPT-4. For instance, dataset labels may not titions organized by the New York Times as part of
encompass all correct answers, and some responses our data sources. These competitions include the
4
extending beyond the given answers might still be "February Vocabulary Challenge: Invent a Word"
accurate. Furthermore, the answers may not be cor- held in January 2022 and the "Student Vocabulary
5
rectly recognized in cases involving identical refer- Challenge: Invent a Word" conducted in February
ents, such as synonyms. Consequently, accounting 2023. Both competitions aim to promote the cre-
for these potential factors during the evaluation of ation of distinctive and memorable new words that
large models could also lead to adjustments in their address gaps in the English language.
performance assessment. However, due to the limited number of responses
in the above contests and to enhance data source
3.4 Discussion: Do LLMs have memorized diversity, we also create new words by randomly
knowledge or truly have the generlization generating letter sequences. This is accomplished
ability? by generating random sequences between 7 and 9
characters in length (including 26 letters of the al-
Drawing from previous experiments, it is apparent
phabet and the symbol "-") and appending common
that large models are adept at swiftly extracting
noun suffixes at random to finalize the construction.
structured knowledge from minimal information.
This observation raises a question regarding the 3.4.2 Data Analysis
origin of the performance advantage in large lan- Our constructed dataset includes 1,400 sentences,
guage models: is it due to the substantial volume of 39 novel relations, and 786 unique entities. In the
textual data utilized during the pre-training stage, construction process, we ensure that each relation
enabling the models to acquire pertinent knowl- type had a minimum of 10 associated samples to fa-
edge, or is it attributed to their robust inference and cilitate subsequent experiments. Notably, we found
generalization capabilities? that in the Re-TACRED test set, certain types of
To delve into this question, we design a Virtual relations have fewer than 10 corresponding data
Knowledge Extraction task, intended to assess instances. To better conduct our experiments, we
the large language models’ capacity to generalize select sentences of corresponding types from the
and extract unfamiliar knowledge. Recognizing training set to offset this deficiency.
the inadequacy of current datasets to address our
requirements, we introduced a novel virtual knowl- 3.4.3 Preliminary Results
edge extraction dataset - VINE. In our experiments, we randomly select ten sen-
Specifically, we construct entities and relations tences, each featuring a different relationship type,
that do not exist in the real world, organizing these 4
https://www.nytimes.com/2022/01/31/learning
into knowledge triples. Subsequently, we use in- /february-vocabulary-challenge-invent-a-word.ht
ml
3 5
https://www.kdnuggets.com/publications/sheet https://www.nytimes.com/2023/02/01/learning
s/ChatGPT_Cheatsheet_Costa.pdf /student-vocabulary-challenge-invent-a-word.html
There might be Subject-Predicate-Object triples in the following sentence. The nologies such as ChatGPT still predominantly de-
predicate between the head and tail entities is known to be: decidiaster.
Please find these two entities and give your answers in the form of [subject, pends on substantial human input to guide the gen-
predicate, object].
eration of conversational text. As users progres-
Demonstrations
Example: sively refine task descriptions and requirements
The given sentence is : The link to the blog was posted on the website of the
newspaper Brabants Dagblad , which has identified the crash survivor as and establish a conversational context with Chat-
Schoolnogo , who had been on safari with his 40-year-old father Patrick , mother
Trudy , 41 , and brother Reptance , 11 . GPT, the model can offer increasingly precise and
Triples: [Schoolnogo, decidiaster, Reptance]
The given sentence is : Intranguish's brother , Nugculous , told reporters in Italy high-quality responses. However, from a model de-
that '' there were moments that I believed he would never come back , '' ANSA
reported . velopment perspective, this process remains labor-
Triples: [Intranguish, decidiaster, Nugculous]
intensive and time-consuming. Consequently, re-
The given sentence is : The Dutch newspaper Brabants Dagblad said Adrenaddict searchers have begun investigating the potential for
had been on safari in South Africa with his mother Trudy , 41 , father Patrick , 40 ,
and brother Reptance . enabling large models to autonomously generate
Triples:
guided text.
Output 6
[Adrenaddict, decidiaster, Reptance] For instance, AutoGPT can independently gen-
erate prompts and carry out tasks such as event
Figure 5: Prompts used in the Virtual Knowledge Ex- analysis, marketing plan creation, programming,
traction task. The blue boxes are the instructions and and mathematical operations. Concurrently, Li et al.
the pink boxes are the answers generated by LLM. (2023) delves into the potential for autonomous co-
operation between communicative agents and intro-
duces a novel cooperative agent framework called
for evaluation. We assess the performance of Chat-
role-playing. This framework employs revelatory
GPT and GPT-4 on these ten test samples after
cues to ensure alignment with human intentions.
learning two demonstrations of the same relation.
Building upon this research, we further inquire: is
Our findings reveal that ChatGPT significantly un-
it feasible to utilize communicative agents to ac-
derperforms GPT-4 in virtual knowledge extraction
complish KG construction and reasoning tasks?
after exposure to a certain amount of virtual knowl-
In this experiment, we utilized the role-playing
edge. In contrast, GPT-4 is able to accurately ex-
method in CAMEL (Li et al., 2023). As depicted
tract knowledge of never-seen entities and relations
in Figure 6, the AI assistant is designated as an
based on instructions. Notably, GPT-4 successfully
Consultant and the AI user as an KG domain ex-
extracted 80% of the virtual triples, while the accu-
pert. Upon receiving the prompt and the specified
racy of ChatGPT is only 27%.
role assignment, the task-specifier agent offers a
In the example shown in Figure 5, we pro-
detailed description to concretize the concept. Fol-
vide large models with a triple composed of vir-
lowing this, the AI assistant and AI user collaborate
tual relation types and virtual head and tail enti-
in a multi-party setting to complete the specified
ties—[Schoolnogo, decidiaster, Reptance] and [In-
task until the AI user confirms its completion. The
tranguish, decidiaster, Nugculous]—along with the
experimental example indicates that the knowledge
respective demonstrations. The results demonstrate
graph related to the film Green Book is more effec-
that GPT-4 effectively completed the extraction of
tively and comprehensively constructed using the
the virtual triple.
multi-agent approach. This result also underscores
Consequently, we tentatively conclude that GPT-
the superiority of LLM-based agents in construct-
4 exhibits a relatively strong generalization ability
ing and completing knowledge graphs.
and can rapidly acquire the capability to extract
new knowledge through instructions, rather than re- Remark. By combining the efforts of artificial in-
lying solely on the memory of relevant knowledge. telligence and human expertise, AutoKG could en-
Related work (Wei et al., 2023a) has also confirmed able swift customization of domain-specific knowl-
that large models possess an exceptionally strong edge graphs. This system allows domain experts
generalization ability concerning instructions. to interact with machine learning models, thereby
fostering a collaborative environment for the con-
4 Future Opportunities: Automatic KG struction of domain-specific knowledge graphs,
Construction and Reasoning via the exchange of expert knowledge and experi-
Recently, LLMs have garnered considerable atten- ence. Furthermore, AutoKG can effectively employ
tion and demonstrated proficiency in a variety of 6
https://github.com/Significant-Gravitas/Auto
complex tasks. Nonetheless, the success of tech- -GPT
ZH
Please construct and complete the knowledge graph about the movie “Green Book”
GPT-4 only
Title: Green Book

• Release Date: November 16, 2018
• Genre: Biography, Comedy, Drama
• Director: Peter Farrelly
• Screenwriters: Nick Vallelonga, Brian Hayes Currie, Peter Farrelly
• Producers: Jim Burke, Charles B. Wessler, Brian Hayes Currie, Peter Farrelly, Nick Vallelonga
• Production Companies: Participant Media, DreamWorks Pictures, Innisfree Pictures, Cinetic Media, Alibaba Pictures
• Actors/Characters:
• Viggo Mortensen as Tony Lip
• Mahershala Ali as Dr. Don Shirley
• Linda Cardellini as Dolores
• Cinematographer: Sean Porter
• Music: Kris Bowers
• Plot Summary: The story of Tony Lip, a bouncer from an Italian-American neighborhood in the Bronx, who is hired to drive Dr. Don Shirley, a world-class
African-American pianist, on a concert tour from Manhattan to the Deep South. They must rely on "The Green Book" to guide them to the few
establishments that were then safe for African-Americans. Confronted with racism, danger as well as unexpected humanity and humor, they are forced to
set aside differences to survive and thrive on the journey of a lifetime.
• Awards: Won 3 Oscars, including Best Picture, Best Supporting Actor (Mahershala Ali), Best Original Screenplay.
ChatGPT with Communicative Agents
Title: Green Book

3. Crew
1. Basic Information
Node: Peter Farrelly
- Release Year: 2018
- Name: Peter Farrelly
- Director: Peter Farrelly
- Role: Director, Writer, Producer
- Producer: Jim Burke, Charles B. Wessler, Brian Currie, Peter Farrelly, Nick
- Awards: Academy Award for Best Picture, Golden Globe Award for
Vallelonga
Best Motion Picture – Musical or Comedy, BAFTA Award for Best
- Writer: Nick Vallelonga, Brian Currie, Peter Farrelly
Original Screenplay
- Genre: Comedy-drama, Road movie, Biographical film
- Plot Summary: Green Book is a 2018 American biographical comedy-drama
Node: Jim Burke
road film directed by Peter Farrelly. Set in 1962, the film is inspired by the true
- Name: Jim Burke
story of a tour of the Deep South by African American classical and jazz pianist
- Role: Producer
Don Shirley (Mahershala Ali) and Italian American bouncer Frank "Tony Lip"
Vallelonga (Viggo Mortensen) who served as Shirley's driver and bodyguard.
Best Motion Picture – Musical or Comedy
2. Cast
Node: Charles B. Wessler
Node: Mahershala Ali
- Name: Charles B. Wessler
AI User - Name: Mahershala Ali
- Role: Producer
- Role: Don Shirley
Knowledge Graph Domain Expert - Awards: Academy Award for Best Picture, Golden Globe Award for
- Awards: Academy Award for Best Supporting Actor, Golden Globe Award for
Best Motion Picture – Musical or Comedy
Best Supporting Actor – Motion Picture, Screen Actors Guild Award for
Outstanding Performance by a Male Actor in a Supporting Role
Node: Brian Currie
- Name: Brian Currie
Node: Viggo Mortensen
- Role: Writer, Producer
- Name: Viggo Mortensen
- Awards: Academy Award for Best Picture, BAFTA Award for Best
- Role: Frank "Tony Lip" Vallelonga
Original Screenplay
- Awards: Academy Award for Best Actor (nominated), Golden Globe Award
for Best Actor – Motion Picture Musical or Comedy (nominated), Screen Actors
Node: Nick Vallelonga
Guild Award for Outstanding Performance by a Male Actor in a Leading Role
AI Assistant - Name: Nick Vallelonga
(nominated)
- Role: Writer, Producer
Consultant
Node: Linda Cardellini
Best Screenplay – Motion Picture, BAFTA Award for Best Original
- Name: Linda Cardellini
Screenplay
- Role: Dolores Vallelonga
- Awards: None
4. Awards
Node: Awards and Recognition for Green Book
Node: Dimiter D. Marinov
- Academy Awards: Best Picture, Best Supporting Actor (Mahershala
- Name: Dimiter D. Marinov
Ali), Best Original Screenplay (Nick Vallelonga, Brian Currie, Peter
- Role: Oleg
Farrelly)
- Awards: None
- Golden Globe Awards: Best Motion Picture – Musical or Comedy,
Best Supporting Actor – Motion Picture (Mahershala Ali), Best
Node: Mike Hatton
Screenplay – Motion Picture (Nick Vallelonga, Brian Currie, Peter
- Name: Mike Hatton
Farrelly)
- Role: George
- Awards: None
……
Figure 6: Illustration of AutoKG, that integrates knowledge graph (KG) construction and reasoning by employing
GPT-4 and communicative agents based on ChatGPT. The figure omits the specific operational process of the AI
assistant and AI user, providing the results directly.
the domain knowledge of human experts to yield tion. This objective, in turn, is expected to increase
high-quality knowledge graphs. Simultaneously, the models’ practical utility.
it could improve the factual accuracy of large lan-
AutoKG can not only expedite the customization
guage models when dealing with domain-specific
of domain-specific knowledge graphs but also aug-
tasks, courtesy of this human-machine collabora-
ments the transparency of large-scale models and
the interaction of embodied agents. More precisely, robust contextual learning abilities. Furthermore,
AutoKG assists in a deeper comprehension of the we propose an innovative method for accomplish-
internal knowledge structures and operating mech- ing KG construction and reasoning tasks by em-
anisms of LLMs, thereby enhancing model trans- ploying multiple agents. This strategy not only al-
parency. Moreover, AutoKG can be deployed as a leviates manual labor but also compensates for the
cooperative human-machine interaction platform, dearth of human expertise across various domains,
enabling effective communication and interaction thereby enhancing the performance of LLMs. Al-
between humans and the model. This interaction though this approach still has some limitations, it
promotes a superior understanding and guidance offers a fresh perspective for the advancement of
of the model’s learning and decision-making pro- future applications of LLMs.
cesses, thereby enhancing the model’s efficiency
and accuracy when dealing with intricate tasks. Limitations
Despite the notable advancements brought about While our research has yielded some results, it also
by our approach, it is not without its limitations, possesses certain limitations As previously stated,
however, present themselves as opportunities for the inability to access the GPT-4 API has neces-
further exploration and improvement: sitated our reliance on an interactive interface for
The utilization of the API is constrained by conducting experiments, undeniably inflating work-
a maximum token limit. Currently, due to the load and time costs. Moreover, as GPT-4’s mul-
unavailability of the GPT-4 API, the gpt-3.5-turbo timodal capabilities are not currently available to
in use is subjected to a max token restriction. This the public, we are temporarily unable to delve into
constraint impacts the construction of KGs, as tasks its performance and contributions to multimodal
may not be executed appropriately if this limit is processing. We look forward to future research
exceeded. opportunities that will allow us to further explore
AutoKG now exhibits shortcomings in facili- these areas.
tating efficient human-machine interaction. In
situations where tasks are wholly autonomously
conducted by machines, humans cannot timely rec- References
tify erroneous occurrences during communication. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-
Conversely, involving humans in each step of ma- liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi-
chine communication can significantly augment wei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do,
Yan Xu, and Pascale Fung. 2023. A multitask,
time and labor costs. Thus, determining the opti- multilingual, multimodal evaluation of chatgpt on
mal moment for human intervention is crucial for reasoning, hallucination, and interactivity. CoRR,
the efficient and effective construction of knowl- abs/2302.04023.
edge graphs. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Training data for LLMs are time-sensitive. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Future work might need to incorporate retrieval Neelakantan, Pranav Shyam, Girish Sastry, Amanda
features from the Internet to compensate for the Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
deficiencies of current large models in acquiring Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
the latest or domain-specific knowledge. Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
5 Conclusion and Future Work Chess, Jack Clark, Christopher Berner, Sam Mc-
Candlish, Alec Radford, Ilya Sutskever, and Dario
In this paper, we seek to preliminarily investigate Amodei. 2020. Language models are few-shot learn-
ers. CoRR, abs/2005.14165.
the performance of LLMs, exemplified by the GPT
series, on tasks such as KG construction, reasoning. Borui Cai, Yong Xiang, Longxiang Gao, He Zhang,
While these models excel at such tasks, we pose the Yunfeng Li, and Jianxin Li. 2022. Temporal knowl-
edge graph completion: a survey. arXiv preprint
question: Do LLMs’ advantage in extraction tasks arXiv:2201.08236.
stem from their vast knowledge base or their potent
contextual learning capability? To explore this, we Xiang Chen, Jintian Zhang, Xiaohan Wang, Tongtong
Wu, Shumin Deng, Yongheng Wang, Luo Si, Hua-
devise a virtual knowledge extraction task and cre- jun Chen, and Ningyu Zhang. 2023. Continual
ated a corresponding dataset for experimentation. multimodal knowledge graph construction. CoRR,
Results indicate that large models indeed possess abs/2305.08698.
Xiang Chen, Ningyu Zhang, Xin Xie, Shumin Deng, Katharina Jeblick, Balthasar Schachtner, Jakob Dexl,
Yunzhi Yao, Chuanqi Tan, Fei Huang, Luo Si, and Andreas Mittermeier, Anna Theresa Stüber, Johanna
Huajun Chen. 2022. Knowprompt: Knowledge- Topalis, Tobias Weber, Philipp Wesp, Bastian O.
aware prompt-tuning with synergistic optimization Sabel, Jens Ricke, and Michael Ingrisch. 2022.
for relation extraction. In WWW ’22: The ACM Web Chatgpt makes medicine easy to swallow: An ex-
Conference 2022, Virtual Event, Lyon, France, April ploratory case study on simplified radiology reports.
25 - 29, 2022, pages 2778–2788. ACM. CoRR, abs/2212.14882.
Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Kelvin Jiang, Dekun Wu, and Hui Jiang. 2019. Free-
Jun Zhao. 2015. Event extraction via dynamic multi- baseqa: A new factoid QA data set matching
pooling convolutional neural networks. In Proceed- trivia-style question-answer pairs with freebase. In
ings of the 53rd Annual Meeting of the Association Proceedings of the 2019 Conference of the North
for Computational Linguistics and the 7th Interna- American Chapter of the Association for Computa-
tional Joint Conference on Natural Language Pro- tional Linguistics: Human Language Technologies,
cessing of the Asian Federation of Natural Language NAACL-HLT 2019, Minneapolis, MN, USA, June 2-
Processing, ACL 2015, July 26-31, 2015, Beijing, 7, 2019, Volume 1 (Long and Short Papers), pages
China, Volume 1: Long Papers, pages 167–176. The 318–323. Association for Computational Linguis-
Association for Computer Linguistics. tics.
Jason P. C. Chiu and Eric Nichols. 2016. Named en- Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing
tity recognition with bidirectional lstm-cnns. Trans. Wang, and Zhaopeng Tu. 2023. Is chatgpt A
Assoc. Comput. Linguistics, 4:357–370. good translator? A preliminary study. CoRR,
abs/2301.08745.
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan
Martic, Shane Legg, and Dario Amodei. 2017. Deep Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
reinforcement learning from human preferences. In S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi
Advances in Neural Information Processing Systems Chen, and Wen-tau Yih. 2020. Dense passage re-
30: Annual Conference on Neural Information Pro- trieval for open-domain question answering. In Pro-
cessing Systems 2017, December 4-9, 2017, Long ceedings of the 2020 Conference on Empirical Meth-
Beach, CA, USA, pages 4299–4307. ods in Natural Language Processing, EMNLP 2020,
Online, November 16-20, 2020, pages 6769–6781.
Shumin Deng, Ningyu Zhang, Jiaojian Kang, Yichi Association for Computational Linguistics.
Zhang, Wei Zhang, and Huajun Chen. 2020. Meta-
Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro
learning with dynamic-memory-based prototypical
Yamada, and Dragomir Radev. 2023. Evaluating gpt-
network for few-shot event detection. In WSDM
4 and chatgpt on japanese medical licensing exami-
’20: The Thirteenth ACM International Conference
nations. arXiv preprint arXiv:2303.18027.
on Web Search and Data Mining, Houston, TX, USA,
February 3-7, 2020, pages 151–159. ACM. Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben
Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui,
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy- and Thien Huu Nguyen. 2023. Chatgpt beyond en-
ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei glish: Towards a comprehensive evaluation of large
Li, and Zhifang Sui. 2023. A survey for in-context language models in multilingual learning. CoRR,
learning. CoRR, abs/2301.00234. abs/2304.05613.
Jun Gao, Huan Zhao, Changlong Yu, and Ruifeng Xu. Guohao Li, Hasan Abed Al Kader Hammoud, Hani
2023. Exploring the feasibility of chatgpt for event Itani, Dmitrii Khizbullin, and Bernard Ghanem.
extraction. CoRR, abs/2303.03836. 2023. CAMEL: communicative agents for "mind"
exploration of large scale language model society.
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, CoRR, abs/2303.17760.
Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and
Yejin Choi. 2021a. (comet-) atomic 2020: On sym- Shuangjie Li, Wei He, Yabing Shi, Wenbin Jiang, Hai-
bolic and neural commonsense knowledge graphs. jin Liang, Ye Jiang, Yang Zhang, Yajuan Lyu, and
In Thirty-Fifth AAAI Conference on Artificial Intel- Yong Zhu. 2019. Duie: A large-scale chinese dataset
ligence, AAAI 2021, Thirty-Third Conference on In- for information extraction. In Natural Language
novative Applications of Artificial Intelligence, IAAI Processing and Chinese Computing - 8th CCF In-
2021, The Eleventh Symposium on Educational Ad- ternational Conference, NLPCC 2019, Dunhuang,
vances in Artificial Intelligence, EAAI 2021, Virtual China, October 9-14, 2019, Proceedings, Part II,
Event, February 2-9, 2021, pages 6384–6392. AAAI volume 11839 of Lecture Notes in Computer Sci-
Press. ence, pages 791–800. Springer.
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Ke Liang, Lingyuan Meng, Meng Liu, Yue Liu, Wenx-
Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and uan Tu, Siwei Wang, Sihang Zhou, Xinwang Liu,
Yejin Choi. 2021b. Comet-atomic 2020: On sym- and Fuchun Sun. 2022. Reasoning over different
bolic and neural commonsense knowledge graphs. types of knowledge graphs: Static, temporal and
In AAAI. multi-modal. CoRR, abs/2212.05767.
Aiwei Liu, Xuming Hu, Lijie Wen, and Philip S. Yu. Andrea Rossi, Denilson Barbosa, Donatella Fir-
2023. A comprehensive evaluation of chatgpt’s zero- mani, Antonio Matinata, and Paolo Merialdo. 2021.
shot text-to-sql capability. CoRR, abs/2303.13547. Knowledge graph embedding for link prediction: A
comparative analysis. ACM Trans. Knowl. Discov.
Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Data, 15(2):14:1–14:49.
Hajishirzi. 2018. Multi-task identification of enti-
ties, relations, and coreference for scientific knowl- Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu,
edge graph construction. In Proceedings of the 2018 and Lakshmivihari Mareedu. 2023. An independent
Conference on Empirical Methods in Natural Lan- evaluation of chatgpt on mathematical word prob-
guage Processing, Brussels, Belgium, October 31 - lems (MWP). CoRR, abs/2302.13814.
November 4, 2018, pages 3219–3232. Association
for Computational Linguistics. Wei Shen, Jianyong Wang, and Jiawei Han. 2015. En-
tity linking with a knowledge base: Issues, tech-
Qing Lyu, Josh Tan, Michael E. Zapadka, Janardhana niques, and solutions. IEEE Trans. Knowl. Data
Ponnatapuram, Chuang Niu, Ge Wang, and Christo- Eng., 27(2):443–460.
pher T. Whitlow. 2023. Translating radiology re-
ports into plain language using chatgpt and GPT-4 Sifatkaur, Manmeet Singh, Vaisakh S. B, and Nee-
with prompt learning: Promising results, limitations, tiraj Malviya. 2023. Mind meets machine: Un-
and potential. CoRR, abs/2303.09038. ravelling gpt-4’s cognitive psychology. CoRR,
abs/2303.11436.
Yubo Ma, Yixin Cao, YongChing Hong, and Aixin Sun.
2023. Large language model is not a good few-shot George Stoica, Emmanouil Antonios Platanios, and
information extractor, but a good reranker for hard Barnabás Póczos. 2021. Re-tacred: Addressing
samples! CoRR, abs/2303.08559. shortcomings of the TACRED dataset. In Thirty-
Fifth AAAI Conference on Artificial Intelligence,
Navid Madani and Kenneth Joseph. 2023. Answer- AAAI 2021, Thirty-Third Conference on Innovative
ing questions over knowledge graphs using logic Applications of Artificial Intelligence, IAAI 2021,
programming along with language models. CoRR, The Eleventh Symposium on Educational Advances
abs/2303.02206. in Artificial Intelligence, EAAI 2021, Virtual Event,
February 2-9, 2021, pages 13843–13850. AAAI
Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir- Press.
Hossein Karimi, Antoine Bordes, and Jason We-
ston. 2016. Key-value memory networks for di- Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu,
rectly reading documents. In Proceedings of the Yongrui Chen, and Guilin Qi. 2023. Evaluation of
2016 Conference on Empirical Methods in Natural chatgpt as a question answering system for answer-
Language Processing, EMNLP 2016, Austin, Texas, ing complex questions. CoRR, abs/2303.07992.
USA, November 1-4, 2016, pages 1400–1409. The
Association for Computational Linguistics. Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-
fung Poon, Pallavi Choudhury, and Michael Gamon.
Harsha Nori, Nicholas King, Scott Mayer McKinney, 2015. Representing text for joint embedding of text
Dean Carignan, and Eric Horvitz. 2023. Capabili- and knowledge bases. In Proceedings of the 2015
ties of gpt-4 on medical challenge problems. arXiv Conference on Empirical Methods in Natural Lan-
preprint arXiv:2303.13375. guage Processing, EMNLP 2015, Lisbon, Portugal,
September 17-21, 2015, pages 1499–1509. The As-
Desnes Nunes, Ricardo Primi, Ramon Pires, Roberto sociation for Computational Linguistics.
de Alencar Lotufo, and Rodrigo Frassetto Nogueira.
2023. Evaluating GPT-3.5 and GPT-4 models Sijia Wang, Mo Yu, and Lifu Huang. 2022a. The art of
on brazilian university admission exams. CoRR, prompting: Event detection based on type specific
abs/2303.17003. prompts. CoRR, abs/2204.07241.
OpenAI. 2023. GPT-4 technical report. CoRR, Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang,
abs/2303.08774. Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai
Lin, and Jie Zhou. 2020. MAVEN: A massive gen-
Seongsik Park and Harksoo Kim. 2021. Improving eral domain event detection dataset. In Proceedings
sentence-level relation extraction through curricu- of the 2020 Conference on Empirical Methods in
lum learning. CoRR, abs/2107.09332. Natural Language Processing, EMNLP 2020, On-
line, November 16-20, 2020, pages 1652–1671. As-
Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, sociation for Computational Linguistics.
Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei
Huang, and Huajun Chen. 2022. Reasoning with Xingyao Wang, Sha Li, and Heng Ji. 2022b.
language model prompting: A survey. CoRR, Code4struct: Code generation for few-shot struc-
abs/2212.09597. tured prediction from natural language. CoRR,
abs/2210.12810.
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao
Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is Xintao Wang, Qianyu He, Jiaqing Liang, and Yanghua
chatgpt a general-purpose natural language process- Xiao. 2022c. Language models as knowledge em-
ing task solver? CoRR, abs/2302.06476. beddings.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raf- graph. In Proceedings of the Thirty-Second AAAI
fel, Barret Zoph, Sebastian Borgeaud, Dani Yo- Conference on Artificial Intelligence, (AAAI-18),
gatama, Maarten Bosma, Denny Zhou, Donald Met- the 30th innovative Applications of Artificial Intel-
zler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, ligence (IAAI-18), and the 8th AAAI Symposium
Percy Liang, Jeff Dean, and William Fedus. 2022. on Educational Advances in Artificial Intelligence
Emergent abilities of large language models. CoRR, (EAAI-18), New Orleans, Louisiana, USA, February
abs/2206.07682. 2-7, 2018, pages 6069–6076. AAAI Press.
Jerry W. Wei, Jason Wei, Yi Tay, Dustin Tran, Al- Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming
bert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Zheng, Soujanya Poria, and Tat-Seng Chua. 2021.
Da Huang, Denny Zhou, and Tengyu Ma. 2023a. Retrieving and reading: A comprehensive sur-
Larger language models do in-context learning dif- vey on open-domain question answering. CoRR,
ferently. CoRR, abs/2303.03846. abs/2101.00774.
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xiangru Zhu, Zhixu Li, Xiaodan Wang, Xueyao Jiang,
Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Penglei Sun, Xuwu Wang, Yanghua Xiao, and
Yufeng Chen, Meishan Zhang, Yong Jiang, and Wen- Nicholas Jing Yuan. 2022. Multi-modal knowledge
juan Han. 2023b. Zero-shot information extraction graph construction and application: A survey. IEEE
via chatting with chatgpt. CoRR, abs/2302.10205. Transactions on Knowledge and Data Engineering.
Deming Ye, Yankai Lin, Peng Li, and Maosong Sun.

2022a. Packed levitated marker for entity and re-
lation extraction. In Proceedings of the 60th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), ACL 2022,
Dublin, Ireland, May 22-27, 2022, pages 4904–4917.
Association for Computational Linguistics.
Hongbin Ye, Ningyu Zhang, Hui Chen, and Huajun

Chen. 2022b. Generative knowledge graph construc-
tion: A review. In Proceedings of the 2022 Con-
ference on Empirical Methods in Natural Language
Processing, EMNLP 2022, Abu Dhabi, United Arab
Emirates, December 7-11, 2022, pages 1–17. Asso-
ciation for Computational Linguistics.
Donghan Yu, Sheng Zhang, Patrick Ng, Henghui Zhu,

Alexander Hanbo Li, Jun Wang, Yiqun Hu, William
Wang, Zhiguo Wang, and Bing Xiang. 2022. De-
caf: Joint decoding of answers and logical forms for
question answering over knowledge bases. CoRR,
abs/2210.00063.
Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.

2015. Distant supervision for relation extraction via
piecewise convolutional neural networks. In Pro-
ceedings of the 2015 Conference on Empirical Meth-
ods in Natural Language Processing, EMNLP 2015,
Lisbon, Portugal, September 17-21, 2015, pages
1753–1762. The Association for Computational Lin-
guistics.
Ningyu Zhang, Qianghuai Jia, Shumin Deng, Xiang

Chen, Hongbin Ye, Hui Chen, Huaixiao Tou, Gang
Huang, Zhao Wang, Nengwei Hua, and Huajun
Chen. 2021. Alicg: Fine-grained and evolvable con-
ceptual graph construction for semantic search at al-
ibaba. In KDD ’21: The 27th ACM SIGKDD Con-
ference on Knowledge Discovery and Data Mining,
Virtual Event, Singapore, August 14-18, 2021, pages
3895–3905. ACM.
Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexan-

der J. Smola, and Le Song. 2018. Variational
reasoning for question answering with knowledge
A Prompts for Evaluation
Here we list the prompts used in each task during the experiment.
Tasks Zero-shot Prompt One-shot Prompt

Relation Extraction The list of predicates: [’HYPONYM-OF’, The list of predicates: [’HYPONYM-OF’,
(SciERC) ’USED-FOR’, ’PART-OF’, ’FEATURE- ’USED-FOR’, ’PART-OF’, ’FEATURE-
OF’, ’COMPARE’, ’CONJUNCTION’, OF’, ’COMPARE’, ’CONJUNCTION’,
’EVALUATE-FOR’]. ’EVALUATE-FOR’].
What Subject-Predicate-Object triples are What Subject-Predicate-Object triples are
included in the following sentence? Please included in the following sentence? Please
return the possible answers according to the return the possible answers according to the
list above. Require the answer only in the list above. Require the answer only in the
form : [subject, predicate, object]. form: [subject, predicate, object]
The given sentence is: On the internal side,
liaisons are established between elements Example:
of the text and the graph by using broadly The given sentence is : We show that
available resources such as a LO-English or various features based on the structure of
better a L0-UNL dictionary, a morphosyntac- email-threads can be used to improve upon
tic parser of L0, and a canonical graph2tree lexical similarity of discourse segments for
transformation. question-answer pairing .
Triples: Triples: [lexical similarity , FEATURE-OF,
discourse segments]
The given sentence is : On the internal

side, liaisons are established between ele-
ments of the text and the graph by using
broadly available resources such as a LO-
English or better a L0-UNL dictionary, a
morphosyntactic parser of L0, and a canonical
graph2tree transformation .
Triples:
Relation Extraction 已知候选谓词列表： [’董事长’, ’获奖’, ’饰已知候选谓词列表： [’主演’, ’配音’, ’成立
(DuIE2.0) 演’, ’成立日期’, ’母亲’, ’作者’, ’歌手’, ’注日期’, ’毕业院校’, ’父亲’, ’出品公司’, ’作
册资本’, ’面积’, ’父亲’, ’首都’, ’人口数量’, 词’, ’作曲’, ’国籍’, ’票房’, ’代言人’, ’董事
’代言人’, ’朝代’, ’所属专辑’, ’邮政编码’, 长’, ’朝代’, ’主持人’, ’嘉宾’, ’改编自’, ’面
’主演’, ’上映时间’, ’丈夫’, ’祖籍’, ’国籍’, 积’, ’丈夫’, ’祖籍’, ’作者’, ’号’, ’主题曲’,
’简称’, ’海拔’, ’出品公司’, ’主持人’, ’作曲’, ’专业代码’, ’主角’, ’妻子’, ’导演’, ’注册资
’编剧’, ’妻子’, ’毕业院校’, ’总部地点’, ’所本’, ’邮政编码’, ’上映时间’, ’所属专辑’,
在城市’, ’校长’, ’主角’, ’票房’, ’主题曲’, ’获奖’, ’气候’, ’简称’, ’占地面积’, ’总部地
’制片人’, ’嘉宾’, ’作词’, ’号’, ’配音’, ’占地点’, ’编剧’, ’所在城市’, ’首都’, ’海拔’, ’官
面积’, ’创始人’, ’改编自’, ’气候’, ’导演’, 方语言’, ’校长’, ’饰演’, ’修业年限’, ’人口
’官方语言’, ’专业代码’, ’修业年限’] . 数量’, ’创始人’, ’制片人’, ’歌手’, ’母亲’] .
请从以下文本中提取可能的主语-谓语-宾请从以下文本中提取可能的主语-谓语-宾
语三元组(SPO三元组)，并以[[主语，谓语三元组(SPO三元组)，并以[[主语，谓
语，宾语]，...]的形式回答语，宾语]，...]的形式回答
给定句子：史奎英，女，中石油下属单
位基层退休干部，原国资委主任、中石油例如:
董事长蒋洁敏妻子. 给定句子: 641年3月2日文成公主入藏，与
SPO三元组: 松赞干布和亲.
SPO三元组: [松赞干布 , 妻子 , 文成公主
]、[文成公主 , 丈夫 , 松赞干布 ]
给定句子: 史奎英，女，中石油下属
单位基层退休干部，原国资委主任、中石
油董事长蒋洁敏妻子
SPO三元组:
Table 3: Examples of zero-shot and one-shot prompts we used on Relation Extraction

Tasks Zero-shot Prompt One-shot Prompt
Event Detection The list of event types: [......] The list of event types: [......]
Give a sentence: Both teams progressed What types of events are included in
to the knockout stages by finishing top the following sentence? Please return
of their group. the most likely answer according to the
What types of events are included in list of event types above. Require the
this sentence? Please return the most answer in the form: Event type
likely answer according to the list of Example:
event types above. Require the answer Give a sentence: Unprepared for the
in the form: Event type attack, the Swedish attempted to save
Ans: their ships by cutting their anchor ropes
and to flee.
Event type: Removing, Rescuing,
Escaping, Attack, Self_motion
Give a sentence: Both teams pro-

gressed to the knockout stages by
finishing top of their group.
Event type:
Link Prediction predict the tail entity [MASK] from predict the tail entity [MASK] from the
the given (40th Academy Awards, time given (1992 NCAA Men’s Division I
event locations, [MASK]) by complet- Basketball Tournament, time event lo-
ing the sentence "what is the locations cations, [MASK]) by completing the
of 40th Academy Awards? The answer sentence "what is the locations of 1992
is". NCAA Men’s Division I Basketball
Tournament? The answer is ".The an-
swer is Albuquerque, so the [MASK] is
Albuquerque.
predict the tail entity [MASK] from
the given (40th Academy Awards, time
event locations, [MASK]) by complet-
ing the sentence "what is the locations
of 40th Academy Awards? The answer
is".
Question Answer- Please answer the following question. Please answer the following question.
ing Note that there may be more than one Note that there may be more than one
answer to the question. answer to the question.
Question: [Lamont Johnson] was the Question: [Aaron Lipstadt] was the di-
director of which films ? rector of which movies ?
Answer: Answer: Android | City Limits
Question: [Lamont Johnson] was the
director of which films ?
Answer:
Table 4: Examples of zero-shot and one-shot prompts we used on Event Detection, Link Prediction, and Question
Answering
B Prompts for Virtual Knowledge Extraction
Prompts
There might be Subject-Predicate-Object triples in the following sentence. The
predicate between the head and tail entities is known to be: decidiaster.
Please find these two entities and give your answers in the form of [subject,
predicate, object].
Example:
The given sentence is : The link to the blog was posted on the website of
the newspaper Brabants Dagblad , which has identified the crash survivor
as Schoolnogo , who had been on safari with his 40-year-old father Patrick ,
mother Trudy , 41 , and brother Reptance , 11 .
Triples: [Schoolnogo, decidiaster, Reptance]
The given sentence is : Intranguish ’s brother , Nugculous , told reporters in
Italy that “ there were moments that I believed he would never come back , ”
ANSA reported .
Triples: [Intranguish, decidiaster, Nugculous]
The given sentence is : The Dutch newspaper Brabants Dagblad said

Adrenaddict had been on safari in South Africa with his mother Trudy , 41 ,
father Patrick , 40 , and brother Reptance .
Triples:
Table 5: Examples of Virtual Knowledge Extraction

LLMs For Knowledge Graph Construction and Reasoning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LLMs For Knowledge Graph Construction and Reasoning

Uploaded by

Copyright:

Available Formats

LLMs for Knowledge Graph Construction and Reasoning:

Recent Capabilities and Future Opportunities

Abstract hand, KG reasoning, often referred to as Link Pre-

The performance of models on KG tasks

Crew Critical Reception

[we, IMPLEMENTED, restricted domain parser called Plume]

[trachoma prevention plan and cooking oil plan, org:member_of,

3.2 KG Construction and Reasoning et al., 2021a) serves as a comprehensive com-

4 exhibits a more substantial improvement in DuIE2.0 Re-TACRED MAVEN SciERC

GPT. Specifically, in the example sentence of Re- Zero-shot

established member of the line-up, he agreed

information, yielding more diverse answers, and

MAVEN dataset, which inherently contains sen- One-shot:

• One-shot During the one-shot experiments,

more significantly under the same settings, whereas ZH

mance. In this context, we will concentrate on

Title: Green Book

ChatGPT with Communicative Agents

Title: Green Book

Deming Ye, Yankai Lin, Peng Li, and Maosong Sun.

Hongbin Ye, Ningyu Zhang, Hui Chen, and Huajun

Donghan Yu, Sheng Zhang, Patrick Ng, Henghui Zhu,

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.

Ningyu Zhang, Qianghuai Jia, Shumin Deng, Xiang

Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexan-

Tasks Zero-shot Prompt One-shot Prompt

The given sentence is : On the internal

Table 3: Examples of zero-shot and one-shot prompts we used on Relation Extraction

Give a sentence: Both teams pro-

The given sentence is : The Dutch newspaper Brabants Dagblad said

Table 5: Examples of Virtual Knowledge Extraction

You might also like