Professional Documents
Culture Documents
Evaluating Top-K RAG-based Approach For Game Review Generation
Evaluating Top-K RAG-based Approach For Game Review Generation
Review Generation
Pratyush Chauhan Rahul Kumar Sahani Soham Datta
dept of computer scienceBennett dept of computer science Bennett dept of computer scienceBennett
University Greater Noida India University Greater Noida India University Greater Noida India
pratyushchauhan62@gmail.com rahul.sahan810@gmail.com sohamdatta34@gmail.com
Furthermore, there is strong evidence to suggest that this The study offers evidence indicating that generative QA
observation holds true for a wide range of other evaluation models can be made more capable of zero-hop multi-hop
metrics that are employed in summarization. This finding is reasoning in two different ways. In the first, training is done on
important because it casts doubt on how well-aligned popular single-hop question concatenation, and in the second, training
summarization evaluation metrics are with the main objective is done on logical forms (SPARQL) that approximate real
of the research, which is to produce summaries that contain multi-hop natural language (NL) questions. Together, these
high-quality information. However, the study highlights a results show that multi-hop reasoning in generative QA models
promising direction for future work: the recently proposed can be encouraged by improving model architectures or
QAEval assessment metric, which involves using question- training strategies rather than being innate.
answering to score summaries, seems to better capture the
quality of the information than the evaluation methodologies Learning to summarize from human feedback [8]. The data
that are currently in use.This realization points to a possible and metrics utilized for certain tasks limit the effective- ness of
direction for improving the way the research community language models’ training and evaluation techniques as they
assesses summarization methods. QAEval generates all get more complex. For example, summarization models are
usually assessed using criteria such as ROUGE and trained to
possible wh-questions in the reference summary, looks for
matching answers in the candidate summary, and calculates the anticipate human-generated reference summaries. These
final score based on the total weighted alignment to achieve measures, however, are not perfect gauges and might not fully
alignment between two summaries. represent the intended conclusion of summary quality. This
study shows that training a machine to optimize for human
Understanding and Improving Zero-shot Multi-hop Rea- preferences can lead to notable gains in summary quality.
soning in Generative Question Answering [14]. Question
Answering (QA) models demonstrate the capacity to produce In order to do this, a sizable, carefully selected dataset of
answers to questions in two scenarios: in an open-book sce- human comparisons between summaries is used. Next, the
nario where pertinent data is also retrieved, or in a closed- algorithm is taught to forecast the summary that is preferred by
book scenario where they rely only on their model parameters. humans. This trained model then serves as a reward function,
We still don’t fully understand the mechanisms behind the enabling reinforcement learning to be used to fine-tune a
success of generative QA models, despite their ability to summarization policy. Positive outcomes are obtained when
handle complex questions. Through a number of experiments, this method is applied to an altered version of Reddit’s TL;DR
this study seeks to explore the multi-hop reasoning capabilities dataset [23]. Comparisons show that the models created with
of generative QA models. this technique perform better than larger models that are
exclusively refined by supervised learning, as well as human
First, the research breaks down multi-hop questions into reference summaries. Additionally, these models work well on
separate single-hop questions and finds significant discrep- CNN/DM [22] news items, producing summaries that are on
ancies in the responses given by QA models for questions that par with human references without the need for extra fine-
appear to be the same. This observation calls for a more tuning for news-specific content.
thorough investigation of these models’ underlying reasoning
abilities, particularly in the context of multi-hop structures. The fine-tuned models and the human feedback dataset are
understood through extensive analyses. According to human
Moreover, the analysis reveals a weakness in these models’ evaluations, the study demonstrates that the reward model is
zero-shot multi-hop reasoning capacity. Suboptimal general- generalizable to new datasets and that optimizing this reward
ization to multi-hop questions is the outcome of exclusive model produces better summaries than optimizing based on
training on single-hop questions. This emphasizes how crucial ROUGE. This emphasizes how crucial it is for machine
it is to enhance the models’ ability to reason in multiple hops learning researchers to carefully analyze the metrics they use
without explicit training on these kinds of situations. to evaluate their models’ performance in addition to how they
are trained and assessed.
[5] Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & [26] Lin, C. Y. (2004, July). Rouge: A package for auto- matic evaluation of
Hashimoto, T. B. (2023). Benchmarking large language models for news summaries. In Text summarization branches out (pp. 74-81).
summarization. arXiv preprintarXiv:2301.13848. [27] Laskar, M. T. R., Fu, X. Y., Chen, C., & Tn, S. B. (2023). Building Real-
[6] AYDIN, Ö . (2023). Google Bard generated literature review: World Meeting Summarization Systems using Large Language Models:
metaverse. Journal of AI, 7(1), 1-14. A Practical Perspective. arXiv preprint arXiv:2310.19233.
[7] Ranjit, M., Ganapathy, G., Manuel, R., & Ganu, T. (2023).
Retrieval augmented chest x-ray report generation using openai gpt
models. arXiv preprint arXiv:2305.03660.
[8] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., ...
& Christiano, P. F. (2020). Learning to summarize with human feedback.
Advances in Neural Information Processing Systems, 33, 3008-3021.
[9] Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas,A., Jones, M., ...
& Gadepally, V. (2023, September). From Words to Watts:
Benchmarking the Energy Costs of Large Language Model Inference. In
2023 IEEE High Performance Extreme Computing Conference (HPEC)
(pp. 1-9). IEEE.
[10] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019).
Bertscore: Evaluating text generation with bert. arXiv preprint
arXiv:1904.09675.
[11] Packer, C., Fang, V., Patil, S. G., Lin, K., Wooders, S., &
Gonzalez, J. E. (2023). Memgpt: Towards llmsas operating systems.
arXiv preprint arXiv:2310.08560.
[12] Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). On
faithfulness and factuality in abstractive summarization. arXiv preprint
arXiv:2005.00661.
[13] Deutsch, D., & Roth, D. (2021, November). Understand- ing the extent to
which content quality metrics measure the information quality of
summaries. In Proceedings of the 25th Conference on Computational
Natural Lan- guage Learning (pp. 300-309).
[14] Jiang, Z., Araki, J., Ding, H., & Neubig, G. (2022). Un- derstanding and
Improving Zero-shot Multi-hop Reason- ing in Generative Question
Answering. arXiv preprint arXiv:2210.04234.
[15] Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022).
Large language models are zero-shot reason- ers. Advances in neural
information processing systems,35, 22199-22213.
[16] Ahmad, W. U., Chakraborty, S., Ray, B., & Chang, K. W. (2020). A
transformer-based approach for source code summarization. arXiv
preprint arXiv:2005.00653.
[17] Tang, L., Sun, Z., Idnay, B., Nestor, J. G., Soroush, A., Elias, P. A.,
... & Peng, Y. (2023). Evaluating large language models on medical
evidence summarization. npj Digital Medicine, 6(1), 158.
[18] Wu, N., Gong, M., Shou, L., Liang, S., & Jiang, D. (2023).
Large language models are diverse role- players for summarization
evaluation. arXiv preprint arXiv:2303.15078.
[19] Pérez, J., Arenas, M., & Gutierrez, C. (2009). Semantics and complexity
of SPARQL. ACM Transactions on Database Systems (TODS), 34(3),
1-45.
[20] McShane, M. (2017). Natural language understanding (NLU, not NLP)
in cognitive systems. AI Magazine, 38(4), 43-56.
[21] Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., ... & Ge, B. (2023).
Summary of chatgpt-related research and perspective towards the future
of large language models.Meta-Radiology, 100017.
[22] Saito, I., Nishida, K., Nishida, K., & Tomita, J. (2020). Abstractive
summarization with combination of pre-trained sequence-to-sequence
and saliency models. arXiv preprint arXiv:2003.13028.
[23] Völske, M., Potthast, M., Syed, S., & Stein, B. (2017, September). Tl;
dr: Mining reddit to learn automatic summarization. In Proceedings of the
Workshop on NewFrontiers in Summarization (pp. 59-63).
[24] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts,
A., ... & Fiedel, N. (2023). Palm: Scaling language modeling with
pathways. Journal of Machine Learning Research, 24(240), 1-113.
[25] Qi, W., Gong, Y., Jiao, J., Yan, Y., Chen, W., Liu, D., & Duan, N.
(2021, July). Bang: Bridging autoregres- sive and non-autoregressive
generation with large scale pretraining. In International Conference on
Machine Learning (pp. 8630-8639). PMLR.