Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2024 International Conference on Computing, Power, and Communication Technologies (IC2PCT)

Evaluating Top-k RAG-based approach for Game


2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT) | 979-8-3503-8354-6/24/$31.00 ©2024 IEEE | DOI: 10.1109/IC2PCT60090.2024.10486273

Review Generation
Pratyush Chauhan Rahul Kumar Sahani Soham Datta
dept of computer scienceBennett dept of computer science Bennett dept of computer scienceBennett
University Greater Noida India University Greater Noida India University Greater Noida India
pratyushchauhan62@gmail.com rahul.sahan810@gmail.com sohamdatta34@gmail.com

Ali Qadir Manish Raj Mohd Mohsin Ali


dept of computer scienceBennett dept of computer science Bennett dept of computer scienceBennett
University Greater Noida India University Greater Noida India University Greater Noida India
ali.qadir.007@outlook.com manish.raj@bennett.edu.in mohdmohsin066@gmail.com

Abstract—Having access to public opinion for a particular A. Conventional approach


product can be a cumbersome task. There are multiple reviews
Refinement through fine-tuning represents a conventional
for the same product. Some may be good or bad depending on the
method for adapting a model for a task that is specific to a
bias of the reviewer. Using LLMs for the interpretation of this
data would make it easier to understand the overall perception of
domain. This process usually entails modifying the model
a product.This is a study regarding how well RAG+LLMs can be weights of a pre-trained model to enhance its performance of a
used as Game Review Generators, built using state-of-the-art desirable domain addressing a target task. For example, one
open source LLM LLaMA 2 13b and the Retrieval Augmented might consider fine-tuning an LLM for generating game
Generation framework llamaindex. The goal here is to generate reviews.
and evaluate game reviews that take elements from a set of game The fine-tuning process is characterized by the intensity of
reviews regarding a particular game without using any form of
time and resources. It necessitates using a substantial volume
fine-tuning. This is achieved by using a rudimentary ‘query
engine’ over a subset of publicly available game reviews. Game
of training data, this requirement poses a challenge in domains
reviews are converted to vector stores which allow us to use top- k where finding quality data is difficult. In the case of game
semantic retrieval for inference. This technique of providing data reviews, a limited number of game reviews can impede the
to an LLM from a document is called Retrieval Augmented acquisition of a sufficiently large dataset for training.
Generation or RAG .Upon experimenting, game-specific reviews
were generated (without using any form of fine-tuning) with the II. PROPOSED APPROACH
help of RAG combined with a top-k semantic retrieval ranking In this study, an innovative approach to game review
system.The application of this technique goes beyond simple game generation is presented that eliminates the need for fine- tuning
review generation, it can be used to generate and query context- for this specific task. This approach involves naive or top-k
specific information on any product given enough base Retrieval Augmented Generation (RAG)[3] combining
information. retrieval with context-specific generation. The retrieval com-
ponent extracts top-k information chunks most similar to the
Keywords—video game review generation, large language query provided and the generation component produces text
models (LLM), LLAMaindex, na¨ıve retrieval augmented gener-
based on the retrieved information.
ation (RAG), top-k semantic retrieval.
This approach offers an efficiency advantage compared to
I. INTRODUCTION fine-tuning. Since there is no training involved, resources are
Game reviews serve as an important resource of conserved in that aspect. It demonstrates an efficient
information for prospective players, it aids in their decision- adaptability to novel domains.
making process regarding whether to buy a particular game or
A. Key contributions
not. These critic reviews offer valuable insights into the
strengths and weaknesses of a game. LLMs[1]offer us a chance • Introduction of a novel method for generating game
to make inferences from these vast amounts of unstructured reviews without reliance on fine-tuning.
data as they are a form of AI that undergo extensive training on • Demonstration of the capability of our method to
vast datasets of text, enabling them to grasp human language producehigh-quality, context-specific game reviews.
patterns and produce content that closely resembles human-
generated content. In the context of generating game reviews, • Illustration of the efficiency and cost-effectiveness of
LLMs prove beneficial in crafting reviews that consider the our method in contrast to fine-tuning. Validation of the
opinions of underlying human-written game reviews. generalizability of our method to diversedomains.

Copyright © IEEE–2024 ISBN: 979-8-3503-8354-6 258


Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:13:36 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Computing, Power, and Communication Technologies (IC2PCT)

III. RELATED WORKS proximate decoding objectives implemented in neural text


Evaluating Large Language Models for Document- generation models result in responses that deviate from human-
grounded Response Generation in Information-Seeking like qualities, particularly in open-ended tasks like language
Dialogues explore the application of large language models modeling and story generation. This investigation delves into
the limitations of these models when applied to abstractive
[4] (LLMs). This work examines how well LLMs such as document summarization. The findings reveal a high suscep-
ChatGPT work in producing responses based on documents tibility of these models to generate content that lacks fidelity
in information-seeking conversations. The authors perform to the input document, a phenomenon commonly referred to
a human evaluation in which annotators evaluate the shared as hallucination.
task-winning system’s output after realizing that automatic
evaluation measures are unable to adequately capture the To comprehensively examine this issue, an extensive
subtleties of these verbose responses. human evaluation involving multiple neural abstractive
summarization systems is conducted. The objective was to
Benchmarking Large Language Models for News Sum- gain insights into the nature of hallucinations produced by
marization[5]. This paper talks about the effectiveness of these systems. The outcomes which were executed by human
LLMs in summarization. The main point is that instruction annotators indicate the presence of significant amounts of
tuning, rather than model size, is the key to LLMs’ success hallucinated content across all summaries generated by the
in zero- shot summarization. The state-of-the-art LLM per- models under consideration.
forms on par with summaries written by freelance writers,
with instruction tuning being the key factor for success. This analysis presents evidence supporting the notion
The evaulation of Large Language Models. Assessment are that pretrained models exhibit superior summarization
conducted on the established CNN/DM dataset and XSUM capabilities, not only in terms of conventional metrics like
dataset. However, challenges arose due to the presence of low- ROUGE[26] but also in their proficiency to generate faithful
quality reference summaries in these datasets. The reference and factually accurate summaries, as discerned through
summaries, as judged by human annotators, were inferior human evaluation. Additionally, it is observed that textual
to the outputs generated by most automated systems. This entailment measures exhibit a stronger correlation with
inferior quality of references hampers the correlation between faithfulness compared to standard metrics. This discovery
metric results and human judgment when computing automatic holds promise for the development of automatic evaluation
metrics. This not only complicates the evaluation process metrics, as well as refined training and decoding criteria in
but also diminishes the performance of systems that receive the domain of abstractive document summarization. NLU
supervision through methods such as finetuning or few-shot [20] driven pretraining generates factual information but
prompting, making meaningful comparisons challenging. it’s not sufficient. Semantic inference-based automatic
To tackle the issues associated with the quality of reference measures are better representations of true summarization
summaries and gain a deeper understanding of how LLMs quality
compare to human summary writers, this study has enlisted the BERTScore: Evaluating Text Generation with BERT
services of freelance writers from Upwork. These freelancers [10]. The paper in question talks about Bertscore, an automatic
were tasked with re-annotating 100 articles from the test sets evaluation metric for generated text. Unlike traditional metrics,
of CNN/DM and XSUM[25]. it calculates token similarity using contextual embeddings
Upon comparing the performance of the most proficient rather than exact matches between candidate and reference
LLM, OpenAI’s Instruct-Davinci, with that of the freelance sentences.BERTSCORE exhibits significantly higher
writers, a notable distinction emerged. The Instruct-Davinci performance compared to the othermetrics.
summaries exhibited a significantly more extractive nature. Understanding the Extent to which Content Quality
Through manual annotation of summarization operations em- Metrics Measure the Information Quality of Summaries
ployed in these summaries, it is observed that Instruct-Davinci
engages in less frequent paraphrasing, although it demonstrates [13]By comparing a summary to a reference, reference-
the ability to coherently combine copied segments. based metrics like ROUGE or BERTScore are frequently used
to assess the content quality of a summary. The fundamental
Google Bard [6] uses retrieval augmented generation to idea is to use the degree of commonality to gauge the
give context-specific answers with sources. Results show information quality of summaries. Nonetheless, in the context
promise for accelerating knowledge expression using AIin of sum- mary comparisons, this study critically investigates the
literature reviews, with a brief comparison to OpenAI token alignments used by ROUGE and BERTScore. The
ChatGPT. Despite higher plagiarism rates in paraphrased analysis casts doubt on the idea that the scores produced by
texts, the study suggests increasing use of AI tools in thesemetrics invariably represent information overlap; rather, it
academic literature. is more appropriate to interpret them as measures of how
On Faithfulness and Factuality in Abstractive Sum- closelyrelated topics are covered in the summaries.
marization [12]. Conventional likelihood training and ap-

Copyright © IEEE–2024 ISBN: 979-8-3503-8354-6 259


Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:13:36 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Computing, Power, and Communication Technologies (IC2PCT)

Fig. 1. Document parsing

Furthermore, there is strong evidence to suggest that this The study offers evidence indicating that generative QA
observation holds true for a wide range of other evaluation models can be made more capable of zero-hop multi-hop
metrics that are employed in summarization. This finding is reasoning in two different ways. In the first, training is done on
important because it casts doubt on how well-aligned popular single-hop question concatenation, and in the second, training
summarization evaluation metrics are with the main objective is done on logical forms (SPARQL) that approximate real
of the research, which is to produce summaries that contain multi-hop natural language (NL) questions. Together, these
high-quality information. However, the study highlights a results show that multi-hop reasoning in generative QA models
promising direction for future work: the recently proposed can be encouraged by improving model architectures or
QAEval assessment metric, which involves using question- training strategies rather than being innate.
answering to score summaries, seems to better capture the
quality of the information than the evaluation methodologies Learning to summarize from human feedback [8]. The data
that are currently in use.This realization points to a possible and metrics utilized for certain tasks limit the effective- ness of
direction for improving the way the research community language models’ training and evaluation techniques as they
assesses summarization methods. QAEval generates all get more complex. For example, summarization models are
usually assessed using criteria such as ROUGE and trained to
possible wh-questions in the reference summary, looks for
matching answers in the candidate summary, and calculates the anticipate human-generated reference summaries. These
final score based on the total weighted alignment to achieve measures, however, are not perfect gauges and might not fully
alignment between two summaries. represent the intended conclusion of summary quality. This
study shows that training a machine to optimize for human
Understanding and Improving Zero-shot Multi-hop Rea- preferences can lead to notable gains in summary quality.
soning in Generative Question Answering [14]. Question
Answering (QA) models demonstrate the capacity to produce In order to do this, a sizable, carefully selected dataset of
answers to questions in two scenarios: in an open-book sce- human comparisons between summaries is used. Next, the
nario where pertinent data is also retrieved, or in a closed- algorithm is taught to forecast the summary that is preferred by
book scenario where they rely only on their model parameters. humans. This trained model then serves as a reward function,
We still don’t fully understand the mechanisms behind the enabling reinforcement learning to be used to fine-tune a
success of generative QA models, despite their ability to summarization policy. Positive outcomes are obtained when
handle complex questions. Through a number of experiments, this method is applied to an altered version of Reddit’s TL;DR
this study seeks to explore the multi-hop reasoning capabilities dataset [23]. Comparisons show that the models created with
of generative QA models. this technique perform better than larger models that are
exclusively refined by supervised learning, as well as human
First, the research breaks down multi-hop questions into reference summaries. Additionally, these models work well on
separate single-hop questions and finds significant discrep- CNN/DM [22] news items, producing summaries that are on
ancies in the responses given by QA models for questions that par with human references without the need for extra fine-
appear to be the same. This observation calls for a more tuning for news-specific content.
thorough investigation of these models’ underlying reasoning
abilities, particularly in the context of multi-hop structures. The fine-tuned models and the human feedback dataset are
understood through extensive analyses. According to human
Moreover, the analysis reveals a weakness in these models’ evaluations, the study demonstrates that the reward model is
zero-shot multi-hop reasoning capacity. Suboptimal general- generalizable to new datasets and that optimizing this reward
ization to multi-hop questions is the outcome of exclusive model produces better summaries than optimizing based on
training on single-hop questions. This emphasizes how crucial ROUGE. This emphasizes how crucial it is for machine
it is to enhance the models’ ability to reason in multiple hops learning researchers to carefully analyze the metrics they use
without explicit training on these kinds of situations. to evaluate their models’ performance in addition to how they
are trained and assessed.

Copyright © IEEE–2024 ISBN: 979-8-3503-8354-6 260


Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:13:36 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Computing, Power, and Communication Technologies (IC2PCT)

Large Language Models are Zero-Shot Reasoners [15]. The


paper in question talks about the surprising zero-shot reasoning
capabilities of pretrained large language models (LLMs).
While LLMs are known for their few-shot learning abilities,
the authors show that by adding a specific prompt, ”Let’s think
step by step,” LLMs significantly outperform existing zero-
shot LLM performance.“Let’s think step by step” followed by
the remaining prompt acts as a single zero-shot prompt ,
expanding the ability to extract cognitive answers.This study
introduces Zero-shot-CoT, a novel approach employing zero-
shot templates for chain of thought reasoning prompts. It
distinguishes itself from the initial chain of thought prompting
method proposed by Wei et al. (2022)[24] by eliminating the
necessity for step-by-step few-shot examples. It also diverges Fig. 2. Flowchart for Sub-Approach 1
from the majority of template prompting methodologies used
before, such as those outlined earlier, by being task-agnostic IV. METHODOLOGY
and facilitating multi-hop reasoning across a diverse array of To develop a methodology to summarize game reviews
tasks through the utilization of a single template. The with a constantly updated knowledge base. Here the traditional
fundamental concept underlying this method is characterized fine-tuning approach runs into a massive hurdle: constantly
by its simplicity. updating data, which takes a lot of time and resources. Due
A Transformer-based Approach for Source Code to constantly updating data, we can not feasibly fine-tune an
Summarization [16]. The paper in question talks about source LLM every time a detail updates.
code summarization, which is the task of generating human- A. Key components:
readable summaries for computer programs. The study focuses LLM: Open source large language model - LLaMA V2 13b
on learning code representation using Transformer models, [9] to serve as generation component as it is relatively small in
leveraging self-attention to capture long-range dependencies size and fast for our computational processing setup due to its
between code tokensBase model outperforms the baselines relatively small memory footprint.
(except for ROGUE-L in java) while full model improves the
performance further. Vector DB: Weaviate for storing vector stores generated
from our documents [2]. Using a vector DB eliminates the
Evaluating large language models on medical evidence need to generate indexes of the knowledge base again and
summarization [17]. The paper in question talks about the again and saves time and resources.
assessment of large language models, specifically GPT-3.5 and
ChatGPT[21], in zero-shot medical evidence summarization Web Scraper: Python web scraper to collect updated game
across various clinical domains. Findings show that having reviews from various public websites.
longer text actually negatively impacts ChatGPT’s capability
to identify and extract the most pertinent information. Data Framework: Python along with LlamaIndex to
seamlessly feed document text-based data into the LLM (Fig
Large Language Models are Diverse Role-Players for 2).
Summarization Evaluation [18]. The paper in question talks
about the challenge of evaluating text summarization quality B. Document parsing:
and proposes a new evaluation framework based on Large In order for the documents to be useable for RAG they
Language Models (LLMs). The framework assesses both must be parsed and converted to vector stores. This is because
objective aspects like correctness and subjective dimensions mathematical operations are applied are applied on them.
like comprehensiveness and interestingness. It uses roleplayer Documents are first converted into chunks of 512 tokens via
prompting and context-based mechanisms to provide a chunking. These chunks are further converted to mathematical
comprehensive evaluation. Currently, there exists a disparity representations or embeddings. These embeddings are repre-
between automated metrics for text generation and human sented as vector stores. Embeddings are necessary to enable
evaluation. Automated metrics typically assess only surface top-k semantic search where the top-k most relevant chunks
similarities (at the lexicon or semantic level). This limitation are filtered and fed into the LLM context for generation.
can result in a biased perception and evaluation of the text Upon receiving a query, a “query engine”[2] is created
generation capabilities of Large Language Models. In using the vector store of a particular game. This vector store
this section, we expand upon our suggested measurement contains game-review embeddings for a specific game. The
framework for text generation, which emphasises on the query engine then retrieves the top-k most relevant chunks
generation of diversified roleplayers and an evaluation based under the process of “Retrieval” (fig 2). These chunks are fed
on these roleplayers. A context-based prompting mechanism into the LLM context for game review generation. With this
that is able to generate dynamic roleplayer profiles based on new context-specific information the LLM is prompted to
input context. generate a game review (Fig 1)

Copyright © IEEE–2024 ISBN: 979-8-3503-8354-6 261


Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:13:36 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Computing, Power, and Communication Technologies (IC2PCT)

V. EXPERIMENTS AND RESULTS B. Unbounded Newline Outburst


Rouge and BERTScores were calculated on generated After the llm is finished gen- erating game reviews we
game reviews for 10 different games based on a knowledge observed that sometimes it continues to return empty lines
base of 120+ human-written game reviews. indefinitely. We call this an Unbounded Newline Outburst,
where the llm does not stop returning new lines. Further
The following results were obtained after generating game research is required to reach the root of this issue.
reviews using LLaMA 2 13b with top-k RAG.
A more complex RAG technique other than top-k similarity
TABLE I. RESULT ANALYSIS METRICS AND SCORES RAG may be used to improve the coverage of game reviews.
Metric Score
ROUGE-1 Precision 0.019032
ROUGE-1 Recall 0.877474
ROUGE-2 Precision 0.005211
ROUGE-2 Recall 0.519979
ROUGE-L Precision 0.018611
ROUGE-L Recall 0.857376
BERTScore 0.813583

TABLE II. F1-SCORE COMPARISON TO MODLES WITHOUT


RAG.[27]
Model R-1 R-2 R-L B-S
GPT-3.5 0.3071 0.0620 0.1861 0.6102
GPT-4 0.3211 0.0611 0.1841 0.6152
PaLM-2 0.2201 0.0420 0.1321 0.5123
LLaMA-2-13b 0.2864 0.0639 0.1829 0.5532
LLaMA-2-7b 0.2965 0.0637 0.1741 0.5766
LLaMA-2-13b (top-3 RAG) 0.0186 0.0051 0.01821 0.8135

A. Inference from results Fig. 3. Result analysis


Low Rouge Precision - In this case a low rouge precision
indicates that the generated review is significantly smaller than VII. CONCLUSION
the reviews provided for validation. It also means only certain In summary, context-specific and comprehensible game re-
parts of the validation review were taken into account while views were successfully generated using human-written game
generating a game review. reviews stored in a vector database. The game reviews
generated are readable and well formulated, the information
High Rouge Recall - A high rouge recall in this case means
within a game review lies well within what was provided in
on average, a large portion of the information present in the
human- written documents.
reference reviews is captured in the generated review.
It was observed that LLMs with RAG can be effectively
High BERTScore - BERTScore[10] is a new metric that
used for game review generation and summarization tasks
considers contextual embeddings while evaluating text. In this
without involving any additional fine-tuning costs. However,
case, it means that the reference review and generated review
RAG may not take all information into consideration and is
are highly similar in terms of context.
limited by the context length of the underlying LLM.
VI. DISCUSSION
REFERENCES
In observation, the game review generator suffers from a [1] Zhang, L., Li, Y., Liu, Z., Liu, J., & Yang, M. (2023). Marathon: A
small context window. Due to this size, it is only able to fit in Race Through the Realm of Long Context with Large Language Models.
up 2 or 3 of the most relevant game-review chunks (top-k arXiv preprint arXiv:2312.09542.V.
where k=3) (where each chunk is 512 tokens long). The quality [2] Schumacher, D. (2023). V3CTRON— Data Retrieval & Access
of game reviews and the number of game review chunks can System For Flexible Semantic Search & Retrieval Of Proprietary
be further increased if this context limit is somehow bypassed. Document Collections Using Natural Language Queries. Available at
SSRN.
[11]
[3] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ...
A. Hardware limitations & Kiela, D. (2020). Retrieval- augmented generation for knowledge-
intensive nlp tasks. Advances in Neural Information Processing Systems,
This experiment was performed on limited hardware (RTX 33,9459-9474.
3050 4GB laptop GPU). The above- specified context limit [4] Braunschweiler, N., Doddipatla, R., Keizer, S., & Stoyanchev, S. (2023).
could also have been mitigated by using a GPU with higher Evaluating Large Language Models for Document-grounded Response
VRAM which would allow us to use a language model with a Genera- tion in Information-Seeking Dialogues. arXiv preprint
larger context window. arXiv:2309.11838.

Copyright © IEEE–2024 ISBN: 979-8-3503-8354-6 262


Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:13:36 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Computing, Power, and Communication Technologies (IC2PCT)

[5] Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & [26] Lin, C. Y. (2004, July). Rouge: A package for auto- matic evaluation of
Hashimoto, T. B. (2023). Benchmarking large language models for news summaries. In Text summarization branches out (pp. 74-81).
summarization. arXiv preprintarXiv:2301.13848. [27] Laskar, M. T. R., Fu, X. Y., Chen, C., & Tn, S. B. (2023). Building Real-
[6] AYDIN, Ö . (2023). Google Bard generated literature review: World Meeting Summarization Systems using Large Language Models:
metaverse. Journal of AI, 7(1), 1-14. A Practical Perspective. arXiv preprint arXiv:2310.19233.
[7] Ranjit, M., Ganapathy, G., Manuel, R., & Ganu, T. (2023).
Retrieval augmented chest x-ray report generation using openai gpt
models. arXiv preprint arXiv:2305.03660.
[8] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., ...
& Christiano, P. F. (2020). Learning to summarize with human feedback.
Advances in Neural Information Processing Systems, 33, 3008-3021.
[9] Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas,A., Jones, M., ...
& Gadepally, V. (2023, September). From Words to Watts:
Benchmarking the Energy Costs of Large Language Model Inference. In
2023 IEEE High Performance Extreme Computing Conference (HPEC)
(pp. 1-9). IEEE.
[10] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019).
Bertscore: Evaluating text generation with bert. arXiv preprint
arXiv:1904.09675.
[11] Packer, C., Fang, V., Patil, S. G., Lin, K., Wooders, S., &
Gonzalez, J. E. (2023). Memgpt: Towards llmsas operating systems.
arXiv preprint arXiv:2310.08560.
[12] Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). On
faithfulness and factuality in abstractive summarization. arXiv preprint
arXiv:2005.00661.
[13] Deutsch, D., & Roth, D. (2021, November). Understand- ing the extent to
which content quality metrics measure the information quality of
summaries. In Proceedings of the 25th Conference on Computational
Natural Lan- guage Learning (pp. 300-309).
[14] Jiang, Z., Araki, J., Ding, H., & Neubig, G. (2022). Un- derstanding and
Improving Zero-shot Multi-hop Reason- ing in Generative Question
Answering. arXiv preprint arXiv:2210.04234.
[15] Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022).
Large language models are zero-shot reason- ers. Advances in neural
information processing systems,35, 22199-22213.
[16] Ahmad, W. U., Chakraborty, S., Ray, B., & Chang, K. W. (2020). A
transformer-based approach for source code summarization. arXiv
preprint arXiv:2005.00653.
[17] Tang, L., Sun, Z., Idnay, B., Nestor, J. G., Soroush, A., Elias, P. A.,
... & Peng, Y. (2023). Evaluating large language models on medical
evidence summarization. npj Digital Medicine, 6(1), 158.
[18] Wu, N., Gong, M., Shou, L., Liang, S., & Jiang, D. (2023).
Large language models are diverse role- players for summarization
evaluation. arXiv preprint arXiv:2303.15078.
[19] Pérez, J., Arenas, M., & Gutierrez, C. (2009). Semantics and complexity
of SPARQL. ACM Transactions on Database Systems (TODS), 34(3),
1-45.
[20] McShane, M. (2017). Natural language understanding (NLU, not NLP)
in cognitive systems. AI Magazine, 38(4), 43-56.
[21] Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., ... & Ge, B. (2023).
Summary of chatgpt-related research and perspective towards the future
of large language models.Meta-Radiology, 100017.
[22] Saito, I., Nishida, K., Nishida, K., & Tomita, J. (2020). Abstractive
summarization with combination of pre-trained sequence-to-sequence
and saliency models. arXiv preprint arXiv:2003.13028.
[23] Völske, M., Potthast, M., Syed, S., & Stein, B. (2017, September). Tl;
dr: Mining reddit to learn automatic summarization. In Proceedings of the
Workshop on NewFrontiers in Summarization (pp. 59-63).
[24] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts,
A., ... & Fiedel, N. (2023). Palm: Scaling language modeling with
pathways. Journal of Machine Learning Research, 24(240), 1-113.
[25] Qi, W., Gong, Y., Jiao, J., Yan, Y., Chen, W., Liu, D., & Duan, N.
(2021, July). Bang: Bridging autoregres- sive and non-autoregressive
generation with large scale pretraining. In International Conference on
Machine Learning (pp. 8630-8639). PMLR.

Copyright © IEEE–2024 ISBN: 979-8-3503-8354-6 263


Authorized licensed use limited to: Universidade Federal Rural do Semiárido - Campus Mossoró. Downloaded on June 21,2024 at 16:13:36 UTC from IEEE Xplore. Restrictions apply.

You might also like