Professional Documents
Culture Documents
LLMs For Knowledge Graph Construction and Reasoning
LLMs For Knowledge Graph Construction and Reasoning
guage Models (LLMs) for Knowledge Graph et al., 2021). Additionally, KGs can be employed in
(KG) construction and reasoning. We em- Question Answering (QA) tasks (Karpukhin et al.,
ploy eight distinct datasets that encompass as- 2020; Zhu et al., 2021), by reasoning over relational
pects including entity, relation and event ex- sub-graphs in relation to the questions.
traction, link prediction, and question answer-
ing. Empirically, our findings suggest that Early, the construction and reasoning of KGs
GPT-4 outperforms ChatGPT in the majority have typically relied on supervised learning
of tasks and even surpasses fine-tuned models methodologies. However, in recent years, with the
in certain reasoning and question-answering notable advancements of Large Language Models
datasets. Moreover, our investigation extends (LLMs), researchers have observed their excep-
to the potential generalization ability of LLMs tional ability within the field of natural language
for information extraction, which culminates
processing (NLP). Despite numerous studies on
in the presentation of the Virtual Knowledge
Extraction task and the development of the LLMs (Liu et al., 2023; Shakarian et al., 2023; Lai
VINE dataset. Drawing on these empirical et al., 2023), a systematic exploration of their appli-
findings, we further propose AutoKG, a multi- cation in the KG domain is still limited. To address
agent-based approach employing LLMs for this gap, our work investigates the potential appli-
KG construction and reasoning, which aims cability of LLMs, exemplified by ChatGPT and
to chart the future of this field and offer excit- GPT-4 (OpenAI, 2023), in KG construction, KG
ing opportunities for advancement. We antic- reasoning tasks. By comprehending the fundamen-
ipate that our research can provide invaluable
1 tal capabilities of LLMs, our study further delves
insights for future undertakings of KG .
into potential future directions in the field.
1 Introduction Specifically, as illustrated in Figure 1, we ini-
Knowledge Graph (KG) is a semantic network com- tially investigate the zero-shot and one-shot per-
prising entities, concepts, and relations (Cai et al., formance of LLMs on entity, relation, and event
2022; Chen et al., 2023; Zhu et al., 2022; Liang extraction, link prediction, and question answer-
et al., 2022), which can catalyse applications across ing to assess their potential applications within the
various scenarios such as recommendation systems, KG domain. The empirical findings reveal that, al-
search engines, and question-answering systems though LLMs exhibit improved performance in KG
(Zhang et al., 2021). Typically, KG construction construction tasks, they still lag behind state-of-the-
(Ye et al., 2022b) consists of several tasks, includ- art (SOTA) models. However, LLMs demonstrate
ing Named Entity Recognition (NER) (Chiu and relatively superior performance in reasoning and
Nichols, 2016), Relation Extraction (RE) (Zeng question answering tasks. This suggests that they
et al., 2015; Chen et al., 2022), Event Extraction are adept at addressing complex problems, com-
(EE) (Chen et al., 2015; Deng et al., 2020), and En- prehending contextual relations, and leveraging the
tity Linking (EL) (Shen et al., 2015). On the other knowledge acquired during pre-training. Conse-
∗ quently, LLMs such as GPT-4 exhibit limited ef-
Equal contribution and shared co-first authorship.
†
Corresponding author.
fectiveness as a few-shot information extractor, yet
1
Code and datasets will be available in https://github demonstrate considerable proficiency as an infer-
.com/zjunlp/AutoKG. ence assistant. To further investigate the behavior
1) Recent Capabilities of GPT-4 for KG Construction and Reasoning 2) Discussion: Do LLMs have memorized knowledge or
truly have the generalization ability?
1. Relation Extraction(RE)
2. Event Detection(ED) Construction
Unseen knowledge
3. Link Prediction(LP) & Reasoning
4. Question Answering(QA) VINE
Virtual Knowledge Large Language
Input Knowledge Graph Extraction Models
Cast Awards
Figure 1: The overview of our work. There are three main components: 1) Basic Evaluation: detailing our
assessment of large models(text-davinci-003, ChatGPT, and GPT-4), in both zero-shot and one-shot settings, using
performance data from fully supervised state-of-the-art models as benchmarks; 2) Virtual Knowledge Extraction:
an examination of large models’ virtual knowledge capabilities on the constructed VINE dataset; and 3) AutoKG:
the proposal of utilizing multiple agents to facilitate the construction and reasoning of KGs.
of LLMs on information extraction tasks, we de- • We evaluate LLMs, including GPT-3.5, Chat-
sign a novel task known as “Virtual Knowledge Ex- GPT, GPT-4, offering an initial understanding
traction”. This undertaking aims to discern whether of their capabilities by evaluating their zero-
the observed improvements in performance stem shot and one-shot performance on KG con-
from the extensive knowledge repository intrin- struction and reasoning on eight benchmark
sic to LLMs, or from their potent generalization datasets.
capabilities facilitated by instruction tuning and
Reinforcement Learning from Human Feedback
(RLHF) (Christiano et al., 2017). Experimental • We design a novel Virtual Knowledge Extrac-
results on a newly constructed dataset, VINE, in- tion task and construct the VINE dataset. By
dicate that LLMs like GPT-4 can rapidly acquire evaluating the performance of LLMs on this
new knowledge from instructions and accomplish dataset, we further demonstrate that LLMs
relevant extraction tasks effectively. such as GPT-4 possess strong generalization
abilities.
Within those empirical findings, we argue that
the considerable reliance of LLMs on instructions
makes the design of appropriate prompts a labori- • We introduce the concept of employing com-
ous and time-consuming endeavor for KG construc- municative agents for automatic KG construc-
tion and reasoning. To facilitate further research, tion and reasoning, known as AutoKG. Lever-
we introduce the concept of AutoKG, which em- aging the LLMs’ knowledge repositories, we
ploys multiple agents of LLMs for KG construction enable multiple agents of LLMs to assist in
and reasoning automatically. the process of KG construction and reason-
In summary, our research has made the following ing through iterative dialogues, providing new
contributions: insights for future researches.
2 Related Work over, ChatGPT has garnered considerable attention
in other various domains, including information
2.1 Large Language Models
extraction (Gao et al., 2023; Wei et al., 2023b), rea-
LLMs are pre-trained on substantial amounts of soning (Qin et al., 2023), text summarization (Je-
textual data and have become a significant com- blick et al., 2022), question answering (Tan et al.,
ponent of contemporary NLP research. Recent 2023) and machine translation (Jiao et al., 2023),
advancements in NLP have led to the development showcasing its versatility and applicability in the
of highly capable LLMs, such as GPT-3 (Brown broader field of natural language processing.
et al., 2020), ChatGPT, and GPT-4, which exhibit While there is a growing body of research on
exceptional performance across a diverse array of ChatGPT, investigations into GPT-4 continue to be
NLP tasks, including machine translation, text sum- relatively limited. Nori et al. (2023) conducts an
marization, and question answering. Concurrently, extensive assessment of GPT-4 on medical com-
several previous studies have indicated that LLMs petency examinations and benchmark datasets and
can achieve remarkable results in relevant down- shows that GPT-4, without any specialized prompt
stream tasks with minimal or even no demonstra- crafting, surpasses the passing score by over 20
tion in the prompt (Wei et al., 2022; Bang et al., points. Kasai et al. (2023) also studies GPT-4’s
2023; Nori et al., 2023; Qiao et al., 2022; Wei et al., performance on the Japanese national medical
2023b). This provides further evidence of the ro- licensing examinations. Furthermore, there are
bustness and generality of LLMs. some studies on GPT-4 that focus on cognitive
psychology (Sifatkaur et al., 2023), academic ex-
2.2 ChatGPT & GPT-4
ams (Nunes et al., 2023), and translation of radiol-
ChatGPT, an advanced LLM developed by Ope- ogy reports (Lyu et al., 2023).
nAI, is primarily designed for engaging in human-
like conversations. During the fine-tuning process,
3 Recent Capabilities of LLMs for KG
ChatGPT utilizes RLHF (Christiano et al., 2017),
Construction and Reasoning
thereby enhancing its alignment with human pref-
erences and values. Recently, the advent of LLMs has invigorated the
As a cutting-edge large language model devel- field of NLP. With an aim to explore the potential
oped by OpenAI, GPT-4 is building upon the suc- applications of LLMs in the domain of KG, we
cesses of its predecessors like GPT-3 and ChatGPT. select representative models, namely ChatGPT and
Trained on an unparalleled scale of computation GPT-4. We conduct a comprehensive assessment
and data, it exhibits remarkable generalization, in- of their performance across eight disparate datasets
ference, and problem-solving capabilities across in the field of KG construction and reasoning.
diverse domains. In addition, as a large-scale multi-
modal model, GPT-4 is capable of processing both
3.1 Evaluation Principle
image and text inputs. In general, the public re-
lease of GPT-4 offers fresh insights into the future In this study, we conduct a systematic evaluation of
advancement of LLMs and presents novel opportu- LLMs across various KG-related tasks. Firstly, we
nities and challenges within the realm of NLP. assess these models’ capabilities in zero-shot and
With the popularity of LLMs, an increasing one-shot NLP tasks. Our primary objective is to
number of researchers are exploring the specific examine their generalization abilities when faced
emergent capabilities and advantages they possess. with limited data, as well as their capacity to rea-
Bang et al. (2023) performs the in-depth analysis son effectively using pre-trained knowledge in the
of ChatGPT on the multitask, multilingual and mul- absence of demonstrations. Secondly, drawing on
timodal aspects. The findings indicate that Chat- the evaluation results, we present a comprehensive
GPT excels at zero-shot learning across various analysis of the factors that contribute to the models’
tasks, even outperforming fine-tuned models in varying performance in distinct tasks. We aim to
certain cases. However, it faces challenges when explore the reasons and potential shortcomings un-
generalized to low-resource languages. Further- derlying their superior performance in certain tasks.
more, in terms of multi-modality, compared to By comparing and summarizing the strengths and
more advanced vision-language models, the capa- limitations of these models, we seek to offer in-
bilities of ChatGPT remain fundamental. More- sights that may guide future improvements.
(1) SciERC (3) DuIE2.0
ZH ZH
We have implemented a restricted domain parser called Plume. 国家队生涯乔治·威尔康姆在2008年入选洪都拉斯国家队,
他随队参加了2009年中北美及加勒比海地区金杯赛
Translation
[Plume, HYPONYM-OF, restricted domain parser]
[We, USED-FOR, implementing Plume]
[乔治·威尔康姆,国家队生涯,入选洪都拉斯国家队]
[乔治·威尔康姆,参加,2009年中北美及加勒比海地区金杯赛]
[乔治·威尔康姆,参加,洪都拉斯国家队]
(2) RE-TACRED [2008年,入选,乔治·威尔康姆]
[2009年中北美及加勒比海地区金杯赛,参加,
ZH
乔治·威尔康姆随队]
The two projects -- a trachoma prevention plan and a cooking oil plan --
are jointly organized by the New York-based Helen Keller International
-LRB- HKI -RRB-, the United Nations Children's Fund and the World [George Wilcombe, National Team Career, Selected for
Health Organization, the spokesman said, adding that the HKI will Honduras National Team]
implement the two programs using funds donated by Taiwan. [George Wilcombe, Participated, 2009 North and Central
America and Caribbean Gold Cup]
[George Wilcombe, Participated, Honduras national team]
[two projects, org:organized_by, Helen Keller International, [George Wilcombe, Participated, Honduras national team]
UNICEF, World Health Organization] [2008, Selected, George Wilcombe]
[2009 North and Central America and the Caribbean Gold Cup,
[two projects, org:implemented_by, Helen Keller International]
Participated, George Wilcombe with the team]
[two projects, org:funded_by, Taiwan]
Translation
Figure 2: Examples of ChatGPT and GPT-4 on the RE datasets. (1) Zero-shot on the SciERC dataset (2) Zero-shot
on the RE-TACRED dataset (3) One-shot on the DuIE2.0 dataset
head-to-tail entity extraction compared to Chat- Fine-Tuned SOTA 69.42 91.4 68.8 53.2
TACRED, the target triple is (Helen Keller Inter- text-davinci-003 11.43 9.8 30.0 4.0
ChatGPT 10.26 15.2 26.5 4.4
national, org:alternate_names, HKI). ChatGPT, GPT-4 31.03 15.5 34.2 7.2
however, fails to extract such relations, possibly One-shot
due to the close proximity of head and tail en- text-davinci-003 30.63 12.8 25.0 4.8
tities and the ambiguity of predicates in this in- ChatGPT 25.86 14.2 34.1 5.3
GPT-4 41.91 22.5 30.4 9.1
stance. In contrast, GPT-4 successfully extracts
the "org:alternate_names" relationship between the Table 1: KG Construction tasks (F1 score).
head and tail entities, completing the triple extrac-
tion. This also demonstrates, to a certain extent,
GPT-4’s enhanced language comprehension (read- Nevertheless, during the experiment, GPT-4
ing) capabilities compared to ChatGPT. does not perform flawlessly and struggled with
• One-shot Simultaneously, the optimization of some complex sentences. It is important to note
text instructions can contribute to the enhancement that factors such as dataset noise and ambiguity
of GPT-4’s performance. The result indicates that in the types, including intricate sentence contexts,
introducing a single training sample in one-shot contribute to this outcome. Moreover, to enable
manner further boosts the triple extraction abilities a fair and cross-sectional comparison of different
of both GPT-4 and ChatGPT. models on the dataset, the experiment does not
specify the type of extracted entities in the prompt,
Using the DuIE2.0 dataset as an example, con-
which could also influence the experimental results.
sider the sentence: "George Wilcombe was selected
This observation may help elucidate why GPT-4’s
for the Honduras national team in 2008, and he par-
performance on SciERC and Re-TACRED is infe-
ticipated in the 2009 North and Central America
rior compared to its performance on DuIE2.0.
and Caribbean Gold Cup with the team". The cor-
responding triple should be (George Wilcombe, Na- Event Extraction We conduct experiments on
tionality, Honduras). Although this information is Event Detection on 20 random samples from the
not explicitly stated in the text, GPT-4 successfully MAVEN dataset. Moreover, Wang et al. (2022a)
extracts it. This outcome is not solely attributed is used as the previous fine-tuned SOTA. Concur-
to the single training example providing valuable rently, GPT-4 has achieved commendable results
information; it also stems from GPT-4’s compre- even without the provision of demonstration. Here
hensive knowledge base. In this case, GPT-4 infers we use the F-score as an evaluation metric.
George Wilcombe’s nationality based on his selec- • Zero-shot Table 1 shows that GPT-4 outper-
tion for the national team. forms ChatGPT in the zero-shot manner. Fur-
thermore, using the example sentence "Now an 1) Link Prediction
result Becoming_a_member, while GPT-4 iden- Predict the tail entity [MASK] from the given (Academy Award for
Best Film Editing, award award category category of, [MASK]) by
tifies three event types: Becoming_a_member, completing the sentence "what is the category of of Academy Award
for Best Film Editing? The answer is ".
The answer is Academy Awards, so the [MASK] is Academy Awards.
Agree_or_refuse_to_act, Performing. It is worth
Zero-shot
noting that in this experiment, ChatGPT frequently Predict the tail entity [MASK] from the given (Primetime Emmy
Award for Outstanding Guest Actress - Comedy Series, award award
provides answers with only one event type. In con- category category of, [MASK]) by completing the sentence "what is
the category of of Primetime Emmy Award for Outstanding Guest
trast, GPT-4 is more adept at acquiring contextual Actress - Comedy Series? The answer is".
When did the films written by [A Gathering of Old Men] writers release ?
GPT-4 experienced a slight decline in performance.
The inclusion of a single demonstration rectifies Answer: 1983|1987|1984|1993|2009|2010|1995
ChatGPT’s erroneous judgments made under the
zero-shot setting, thereby enhancing its perfor- Answer: 1999|1974|1987|1995|1984
GPT-4 only
Figure 6: Illustration of AutoKG, that integrates knowledge graph (KG) construction and reasoning by employing
GPT-4 and communicative agents based on ChatGPT. The figure omits the specific operational process of the AI
assistant and AI user, providing the results directly.
the domain knowledge of human experts to yield tion. This objective, in turn, is expected to increase
high-quality knowledge graphs. Simultaneously, the models’ practical utility.
it could improve the factual accuracy of large lan-
AutoKG can not only expedite the customization
guage models when dealing with domain-specific
of domain-specific knowledge graphs but also aug-
tasks, courtesy of this human-machine collabora-
ments the transparency of large-scale models and
the interaction of embodied agents. More precisely, robust contextual learning abilities. Furthermore,
AutoKG assists in a deeper comprehension of the we propose an innovative method for accomplish-
internal knowledge structures and operating mech- ing KG construction and reasoning tasks by em-
anisms of LLMs, thereby enhancing model trans- ploying multiple agents. This strategy not only al-
parency. Moreover, AutoKG can be deployed as a leviates manual labor but also compensates for the
cooperative human-machine interaction platform, dearth of human expertise across various domains,
enabling effective communication and interaction thereby enhancing the performance of LLMs. Al-
between humans and the model. This interaction though this approach still has some limitations, it
promotes a superior understanding and guidance offers a fresh perspective for the advancement of
of the model’s learning and decision-making pro- future applications of LLMs.
cesses, thereby enhancing the model’s efficiency
and accuracy when dealing with intricate tasks. Limitations
Despite the notable advancements brought about While our research has yielded some results, it also
by our approach, it is not without its limitations, possesses certain limitations As previously stated,
however, present themselves as opportunities for the inability to access the GPT-4 API has neces-
further exploration and improvement: sitated our reliance on an interactive interface for
The utilization of the API is constrained by conducting experiments, undeniably inflating work-
a maximum token limit. Currently, due to the load and time costs. Moreover, as GPT-4’s mul-
unavailability of the GPT-4 API, the gpt-3.5-turbo timodal capabilities are not currently available to
in use is subjected to a max token restriction. This the public, we are temporarily unable to delve into
constraint impacts the construction of KGs, as tasks its performance and contributions to multimodal
may not be executed appropriately if this limit is processing. We look forward to future research
exceeded. opportunities that will allow us to further explore
AutoKG now exhibits shortcomings in facili- these areas.
tating efficient human-machine interaction. In
situations where tasks are wholly autonomously
conducted by machines, humans cannot timely rec- References
tify erroneous occurrences during communication. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-
Conversely, involving humans in each step of ma- liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi-
chine communication can significantly augment wei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do,
Yan Xu, and Pascale Fung. 2023. A multitask,
time and labor costs. Thus, determining the opti- multilingual, multimodal evaluation of chatgpt on
mal moment for human intervention is crucial for reasoning, hallucination, and interactivity. CoRR,
the efficient and effective construction of knowl- abs/2302.04023.
edge graphs. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Training data for LLMs are time-sensitive. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Future work might need to incorporate retrieval Neelakantan, Pranav Shyam, Girish Sastry, Amanda
features from the Internet to compensate for the Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
deficiencies of current large models in acquiring Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
the latest or domain-specific knowledge. Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
5 Conclusion and Future Work Chess, Jack Clark, Christopher Berner, Sam Mc-
Candlish, Alec Radford, Ilya Sutskever, and Dario
In this paper, we seek to preliminarily investigate Amodei. 2020. Language models are few-shot learn-
ers. CoRR, abs/2005.14165.
the performance of LLMs, exemplified by the GPT
series, on tasks such as KG construction, reasoning. Borui Cai, Yong Xiang, Longxiang Gao, He Zhang,
While these models excel at such tasks, we pose the Yunfeng Li, and Jianxin Li. 2022. Temporal knowl-
edge graph completion: a survey. arXiv preprint
question: Do LLMs’ advantage in extraction tasks arXiv:2201.08236.
stem from their vast knowledge base or their potent
contextual learning capability? To explore this, we Xiang Chen, Jintian Zhang, Xiaohan Wang, Tongtong
Wu, Shumin Deng, Yongheng Wang, Luo Si, Hua-
devise a virtual knowledge extraction task and cre- jun Chen, and Ningyu Zhang. 2023. Continual
ated a corresponding dataset for experimentation. multimodal knowledge graph construction. CoRR,
Results indicate that large models indeed possess abs/2305.08698.
Xiang Chen, Ningyu Zhang, Xin Xie, Shumin Deng, Katharina Jeblick, Balthasar Schachtner, Jakob Dexl,
Yunzhi Yao, Chuanqi Tan, Fei Huang, Luo Si, and Andreas Mittermeier, Anna Theresa Stüber, Johanna
Huajun Chen. 2022. Knowprompt: Knowledge- Topalis, Tobias Weber, Philipp Wesp, Bastian O.
aware prompt-tuning with synergistic optimization Sabel, Jens Ricke, and Michael Ingrisch. 2022.
for relation extraction. In WWW ’22: The ACM Web Chatgpt makes medicine easy to swallow: An ex-
Conference 2022, Virtual Event, Lyon, France, April ploratory case study on simplified radiology reports.
25 - 29, 2022, pages 2778–2788. ACM. CoRR, abs/2212.14882.
Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Kelvin Jiang, Dekun Wu, and Hui Jiang. 2019. Free-
Jun Zhao. 2015. Event extraction via dynamic multi- baseqa: A new factoid QA data set matching
pooling convolutional neural networks. In Proceed- trivia-style question-answer pairs with freebase. In
ings of the 53rd Annual Meeting of the Association Proceedings of the 2019 Conference of the North
for Computational Linguistics and the 7th Interna- American Chapter of the Association for Computa-
tional Joint Conference on Natural Language Pro- tional Linguistics: Human Language Technologies,
cessing of the Asian Federation of Natural Language NAACL-HLT 2019, Minneapolis, MN, USA, June 2-
Processing, ACL 2015, July 26-31, 2015, Beijing, 7, 2019, Volume 1 (Long and Short Papers), pages
China, Volume 1: Long Papers, pages 167–176. The 318–323. Association for Computational Linguis-
Association for Computer Linguistics. tics.
Jason P. C. Chiu and Eric Nichols. 2016. Named en- Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing
tity recognition with bidirectional lstm-cnns. Trans. Wang, and Zhaopeng Tu. 2023. Is chatgpt A
Assoc. Comput. Linguistics, 4:357–370. good translator? A preliminary study. CoRR,
abs/2301.08745.
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan
Martic, Shane Legg, and Dario Amodei. 2017. Deep Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
reinforcement learning from human preferences. In S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi
Advances in Neural Information Processing Systems Chen, and Wen-tau Yih. 2020. Dense passage re-
30: Annual Conference on Neural Information Pro- trieval for open-domain question answering. In Pro-
cessing Systems 2017, December 4-9, 2017, Long ceedings of the 2020 Conference on Empirical Meth-
Beach, CA, USA, pages 4299–4307. ods in Natural Language Processing, EMNLP 2020,
Online, November 16-20, 2020, pages 6769–6781.
Shumin Deng, Ningyu Zhang, Jiaojian Kang, Yichi Association for Computational Linguistics.
Zhang, Wei Zhang, and Huajun Chen. 2020. Meta-
Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro
learning with dynamic-memory-based prototypical
Yamada, and Dragomir Radev. 2023. Evaluating gpt-
network for few-shot event detection. In WSDM
4 and chatgpt on japanese medical licensing exami-
’20: The Thirteenth ACM International Conference
nations. arXiv preprint arXiv:2303.18027.
on Web Search and Data Mining, Houston, TX, USA,
February 3-7, 2020, pages 151–159. ACM. Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben
Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui,
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy- and Thien Huu Nguyen. 2023. Chatgpt beyond en-
ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei glish: Towards a comprehensive evaluation of large
Li, and Zhifang Sui. 2023. A survey for in-context language models in multilingual learning. CoRR,
learning. CoRR, abs/2301.00234. abs/2304.05613.
Jun Gao, Huan Zhao, Changlong Yu, and Ruifeng Xu. Guohao Li, Hasan Abed Al Kader Hammoud, Hani
2023. Exploring the feasibility of chatgpt for event Itani, Dmitrii Khizbullin, and Bernard Ghanem.
extraction. CoRR, abs/2303.03836. 2023. CAMEL: communicative agents for "mind"
exploration of large scale language model society.
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, CoRR, abs/2303.17760.
Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and
Yejin Choi. 2021a. (comet-) atomic 2020: On sym- Shuangjie Li, Wei He, Yabing Shi, Wenbin Jiang, Hai-
bolic and neural commonsense knowledge graphs. jin Liang, Ye Jiang, Yang Zhang, Yajuan Lyu, and
In Thirty-Fifth AAAI Conference on Artificial Intel- Yong Zhu. 2019. Duie: A large-scale chinese dataset
ligence, AAAI 2021, Thirty-Third Conference on In- for information extraction. In Natural Language
novative Applications of Artificial Intelligence, IAAI Processing and Chinese Computing - 8th CCF In-
2021, The Eleventh Symposium on Educational Ad- ternational Conference, NLPCC 2019, Dunhuang,
vances in Artificial Intelligence, EAAI 2021, Virtual China, October 9-14, 2019, Proceedings, Part II,
Event, February 2-9, 2021, pages 6384–6392. AAAI volume 11839 of Lecture Notes in Computer Sci-
Press. ence, pages 791–800. Springer.
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Ke Liang, Lingyuan Meng, Meng Liu, Yue Liu, Wenx-
Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and uan Tu, Siwei Wang, Sihang Zhou, Xinwang Liu,
Yejin Choi. 2021b. Comet-atomic 2020: On sym- and Fuchun Sun. 2022. Reasoning over different
bolic and neural commonsense knowledge graphs. types of knowledge graphs: Static, temporal and
In AAAI. multi-modal. CoRR, abs/2212.05767.
Aiwei Liu, Xuming Hu, Lijie Wen, and Philip S. Yu. Andrea Rossi, Denilson Barbosa, Donatella Fir-
2023. A comprehensive evaluation of chatgpt’s zero- mani, Antonio Matinata, and Paolo Merialdo. 2021.
shot text-to-sql capability. CoRR, abs/2303.13547. Knowledge graph embedding for link prediction: A
comparative analysis. ACM Trans. Knowl. Discov.
Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Data, 15(2):14:1–14:49.
Hajishirzi. 2018. Multi-task identification of enti-
ties, relations, and coreference for scientific knowl- Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu,
edge graph construction. In Proceedings of the 2018 and Lakshmivihari Mareedu. 2023. An independent
Conference on Empirical Methods in Natural Lan- evaluation of chatgpt on mathematical word prob-
guage Processing, Brussels, Belgium, October 31 - lems (MWP). CoRR, abs/2302.13814.
November 4, 2018, pages 3219–3232. Association
for Computational Linguistics. Wei Shen, Jianyong Wang, and Jiawei Han. 2015. En-
tity linking with a knowledge base: Issues, tech-
Qing Lyu, Josh Tan, Michael E. Zapadka, Janardhana niques, and solutions. IEEE Trans. Knowl. Data
Ponnatapuram, Chuang Niu, Ge Wang, and Christo- Eng., 27(2):443–460.
pher T. Whitlow. 2023. Translating radiology re-
ports into plain language using chatgpt and GPT-4 Sifatkaur, Manmeet Singh, Vaisakh S. B, and Nee-
with prompt learning: Promising results, limitations, tiraj Malviya. 2023. Mind meets machine: Un-
and potential. CoRR, abs/2303.09038. ravelling gpt-4’s cognitive psychology. CoRR,
abs/2303.11436.
Yubo Ma, Yixin Cao, YongChing Hong, and Aixin Sun.
2023. Large language model is not a good few-shot George Stoica, Emmanouil Antonios Platanios, and
information extractor, but a good reranker for hard Barnabás Póczos. 2021. Re-tacred: Addressing
samples! CoRR, abs/2303.08559. shortcomings of the TACRED dataset. In Thirty-
Fifth AAAI Conference on Artificial Intelligence,
Navid Madani and Kenneth Joseph. 2023. Answer- AAAI 2021, Thirty-Third Conference on Innovative
ing questions over knowledge graphs using logic Applications of Artificial Intelligence, IAAI 2021,
programming along with language models. CoRR, The Eleventh Symposium on Educational Advances
abs/2303.02206. in Artificial Intelligence, EAAI 2021, Virtual Event,
February 2-9, 2021, pages 13843–13850. AAAI
Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir- Press.
Hossein Karimi, Antoine Bordes, and Jason We-
ston. 2016. Key-value memory networks for di- Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu,
rectly reading documents. In Proceedings of the Yongrui Chen, and Guilin Qi. 2023. Evaluation of
2016 Conference on Empirical Methods in Natural chatgpt as a question answering system for answer-
Language Processing, EMNLP 2016, Austin, Texas, ing complex questions. CoRR, abs/2303.07992.
USA, November 1-4, 2016, pages 1400–1409. The
Association for Computational Linguistics. Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-
fung Poon, Pallavi Choudhury, and Michael Gamon.
Harsha Nori, Nicholas King, Scott Mayer McKinney, 2015. Representing text for joint embedding of text
Dean Carignan, and Eric Horvitz. 2023. Capabili- and knowledge bases. In Proceedings of the 2015
ties of gpt-4 on medical challenge problems. arXiv Conference on Empirical Methods in Natural Lan-
preprint arXiv:2303.13375. guage Processing, EMNLP 2015, Lisbon, Portugal,
September 17-21, 2015, pages 1499–1509. The As-
Desnes Nunes, Ricardo Primi, Ramon Pires, Roberto sociation for Computational Linguistics.
de Alencar Lotufo, and Rodrigo Frassetto Nogueira.
2023. Evaluating GPT-3.5 and GPT-4 models Sijia Wang, Mo Yu, and Lifu Huang. 2022a. The art of
on brazilian university admission exams. CoRR, prompting: Event detection based on type specific
abs/2303.17003. prompts. CoRR, abs/2204.07241.
OpenAI. 2023. GPT-4 technical report. CoRR, Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang,
abs/2303.08774. Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai
Lin, and Jie Zhou. 2020. MAVEN: A massive gen-
Seongsik Park and Harksoo Kim. 2021. Improving eral domain event detection dataset. In Proceedings
sentence-level relation extraction through curricu- of the 2020 Conference on Empirical Methods in
lum learning. CoRR, abs/2107.09332. Natural Language Processing, EMNLP 2020, On-
line, November 16-20, 2020, pages 1652–1671. As-
Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, sociation for Computational Linguistics.
Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei
Huang, and Huajun Chen. 2022. Reasoning with Xingyao Wang, Sha Li, and Heng Ji. 2022b.
language model prompting: A survey. CoRR, Code4struct: Code generation for few-shot struc-
abs/2212.09597. tured prediction from natural language. CoRR,
abs/2210.12810.
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao
Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is Xintao Wang, Qianyu He, Jiaqing Liang, and Yanghua
chatgpt a general-purpose natural language process- Xiao. 2022c. Language models as knowledge em-
ing task solver? CoRR, abs/2302.06476. beddings.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raf- graph. In Proceedings of the Thirty-Second AAAI
fel, Barret Zoph, Sebastian Borgeaud, Dani Yo- Conference on Artificial Intelligence, (AAAI-18),
gatama, Maarten Bosma, Denny Zhou, Donald Met- the 30th innovative Applications of Artificial Intel-
zler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, ligence (IAAI-18), and the 8th AAAI Symposium
Percy Liang, Jeff Dean, and William Fedus. 2022. on Educational Advances in Artificial Intelligence
Emergent abilities of large language models. CoRR, (EAAI-18), New Orleans, Louisiana, USA, February
abs/2206.07682. 2-7, 2018, pages 6069–6076. AAAI Press.
Jerry W. Wei, Jason Wei, Yi Tay, Dustin Tran, Al- Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming
bert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Zheng, Soujanya Poria, and Tat-Seng Chua. 2021.
Da Huang, Denny Zhou, and Tengyu Ma. 2023a. Retrieving and reading: A comprehensive sur-
Larger language models do in-context learning dif- vey on open-domain question answering. CoRR,
ferently. CoRR, abs/2303.03846. abs/2101.00774.
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xiangru Zhu, Zhixu Li, Xiaodan Wang, Xueyao Jiang,
Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Penglei Sun, Xuwu Wang, Yanghua Xiao, and
Yufeng Chen, Meishan Zhang, Yong Jiang, and Wen- Nicholas Jing Yuan. 2022. Multi-modal knowledge
juan Han. 2023b. Zero-shot information extraction graph construction and application: A survey. IEEE
via chatting with chatgpt. CoRR, abs/2302.10205. Transactions on Knowledge and Data Engineering.
给 定 句 子: 史 奎 英 , 女 , 中 石 油 下 属
单位基层退休干部,原国资委主任、中石
油董事长蒋洁敏妻子
SPO三元组:
Link Prediction predict the tail entity [MASK] from predict the tail entity [MASK] from the
the given (40th Academy Awards, time given (1992 NCAA Men’s Division I
event locations, [MASK]) by complet- Basketball Tournament, time event lo-
ing the sentence "what is the locations cations, [MASK]) by completing the
of 40th Academy Awards? The answer sentence "what is the locations of 1992
is". NCAA Men’s Division I Basketball
Tournament? The answer is ".The an-
swer is Albuquerque, so the [MASK] is
Albuquerque.
predict the tail entity [MASK] from
the given (40th Academy Awards, time
event locations, [MASK]) by complet-
ing the sentence "what is the locations
of 40th Academy Awards? The answer
is".
Question Answer- Please answer the following question. Please answer the following question.
ing Note that there may be more than one Note that there may be more than one
answer to the question. answer to the question.
Question: [Lamont Johnson] was the Question: [Aaron Lipstadt] was the di-
director of which films ? rector of which movies ?
Answer: Answer: Android | City Limits
Question: [Lamont Johnson] was the
director of which films ?
Answer:
Table 4: Examples of zero-shot and one-shot prompts we used on Event Detection, Link Prediction, and Question
Answering
B Prompts for Virtual Knowledge Extraction
Prompts
There might be Subject-Predicate-Object triples in the following sentence. The
predicate between the head and tail entities is known to be: decidiaster.
Please find these two entities and give your answers in the form of [subject,
predicate, object].
Example:
The given sentence is : The link to the blog was posted on the website of
the newspaper Brabants Dagblad , which has identified the crash survivor
as Schoolnogo , who had been on safari with his 40-year-old father Patrick ,
mother Trudy , 41 , and brother Reptance , 11 .
Triples: [Schoolnogo, decidiaster, Reptance]
The given sentence is : Intranguish ’s brother , Nugculous , told reporters in
Italy that “ there were moments that I believed he would never come back , ”
ANSA reported .
Triples: [Intranguish, decidiaster, Nugculous]