Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 57

Applications of ChatGPT & Co in scientific research:

systematic search and comparison of tools


Mohamed Nejjar

Technical University of Munich


mohamed.nejjar@tum.de

Abstract. In the rapidly evolving domain of artificial intelligence, generative language models have
emerged as powerful tools for various applications. This study critically examines the efficacy of nine AI
tools - ChatGPT-3.5, ChatGPT-4, Google Bard, BingAI, YouChat, QuillBot, DeepL (Write Beta),
Graphmaker.ai, and Github Copilot - in facilitating or even conducting scientific research. The tools were
assessed across a range of tasks including text enhancement, text summarization, the retrieval of relevant
references, and assistance with developing presentations. Additionally, they were evaluated in code
generation, data visualization, and data analysis. The evaluation criteria encompassed precision,
completeness, context understanding, creativity, adaptability, and reliability. The findings reveal that while
some tools excel in specific tasks and some are found to be underwhelming in others, ChatGPT-4 emerges
as an all-rounder with consistently satisfactory results. However, AI tools should be regarded as research
assistants and sources of inspiration rather than replacements for human intellect, as they are susceptible to
errors and prone to occasional inaccuracies, including hallucinations. The study concludes that AI tools,
particularly generative language models, harbor the capacity for significant time economization and
efficiency enhancement in scientific research.

Keywords: Generative Language Models, AI Tools in Scientific Research, Comparative Evaluation,


Research Productivity Enhancement, Human-AI Collaboration.

1 Introduction

In the contemporary landscape of scientific research, the quest for efficiency and innovation is ceaseless. The
burgeoning field of Artificial Intelligence (AI) is indispensable in this pursuit. Among the myriad of AI
technologies, generative language models, such as ChatGPT, [7-8] have garnered particular attention for their
versatility and prowess in natural language processing tasks. This paper delves into the applications of ChatGPT and
its counterparts in scientific research, with a focus on systematically comparing and evaluating their capabilities
according to a selected set of criteria.

To fully appreciate the significance and potential of these AI tools, it is essential to understand the historical context
and foundational concepts that underpin them. Generative language models are a subset of AI that specializes in
generating text. They are built upon neural networks, which are computational models inspired by the human brain.
These networks are trained on vast datasets, enabling them to generate human-like text. The development of
generative models has been marked by a series of advancements, with the introduction of the Transformer
architecture being a pivotal moment. The Transformer architecture [3] which underlies models like GPT,
revolutionized natural language processing with its ability to process words in parallel and focus on different parts of
a sentence, irrespective of their position.

The landscape of AI tools is vast and varied, ranging from generative language models to data visualization tools.
The common thread that binds these tools is their potential to revolutionize the way scientific research is conducted.

1
However, with the plethora of tools available, it is imperative to discern which tools are best suited for specific tasks
and to what extent they can augment the research process.

Historically, scientific research has been a labor-intensive process, often requiring extensive literature reviews, data
analysis, and content generation. The advent of AI tools promises to alleviate some of these burdens. However,
while they have shown promise, the extent of their efficacy across a broader range of tasks remains an open
question. Previous studies have explored the use of AI in specific aspects of scientific research such as data analysis
and literature review. For instance, AI-based tools [1] like Semantic Scholar and Elicit have been developed to assist
in literature reviews by automating parts of the process and providing access to a vast database of scientific papers.
However, there is a paucity of comprehensive evaluations that compare a wide range of AI tools across multiple
dimensions including text enhancement, code generation, and data visualization. This paper aims to bridge this
knowledge gap by providing a comprehensive evaluation of nine AI tools, namely ChatGPT-3.5 [7], ChatGPT-4
[8] , Google Bard [9] , BingAI [10] , YouChat [11] , QuillBot [15], DeepL (Write Beta) [12], Graphmaker.ai [13] ,
and Github Copilot. [14] The AI tools selected for this study represent a cross-section of the most advanced and
widely used solutions in the field. The evaluation criteria were chosen to reflect the key attributes that are essential
for tools to be effective in scientific research, including their ability to produce relevant, complete, and reliable
outputs.

In conclusion, the primary objective of this paper is to critically assess the efficacy of various AI tools in facilitating
scientific research. By doing so, it aims to furnish the reader with a holistic understanding of the state of AI in
scientific research. The findings of this study may offer insights that could be valuable for future research and
development in AI, paving the way for more sophisticated and integrated AI tools that can further catalyze
innovation and efficiency in scientific research. However, it is important to approach the results with caution and
consider the limitations of the current AI tools – since they can also generate unwanted and irrelevant answers just
like they can provide sensible and precise information – as well as the rapidly evolving nature of the field.

In line with the principles of open science this study has been conducted to ensure transparency and reproducibility.
The prompts and answers provided in the appendix allow an exact reproduction of the methodology used in this
paper.

The remainder of this paper is structured as follows: Section 2 provides an in-depth background on generative
language models and the AI tools evaluated. Section 3 details the methodology employed for evaluating the AI tools
and presents the results and findings of the evaluation. Section 4 discusses the implications of the findings. Section 5
provides an overview of the tools used as well as a rating system based on the evaluation criteria and finally, Section
6 concludes the paper and offers directions for future research. Supplementary material and additional data can be
found in the Appendix.

2 Background
2.1 Introduction to Artificial Intelligence

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to
emulate human-like thinking and learning. It is the science and engineering of making intelligent machines,
especially intelligent computer programs. It is related to the similar task of using computers to understand human
intelligence, but AI does not have to confine itself to methods that are biologically observable. [4]

2.2 Evolution of natural language processing

Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of
artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in
much the same way human beings can. NLP combines computational linguistics—rule-based modeling of human
language—with statistical, machine learning, and deep learning models. Together, these technologies enable

2
computers to process human language in the form of text or voice data and to ‘understand’ its full meaning,
complete with the speaker or writer’s intent and sentiment. [5]

One of the most significant advancements in NLP is the development of the Transformer architecture. It forms the
basis of models like GPT and other generative language models. [3]

A generative model can take what it has learned from the examples it’s been shown and create something entirely
new based on that information. Hence the word “generative!” Large language models (LLMs) are one type of
generative AI since they generate novel combinations of text in the form of natural-sounding language. And we can
even build language models to generate other types of outputs, such as new images, audio and even video, like with
Imagen, AudioLM and Phenaki. [6]

2.3 Overview of AI Tools in Scientific Research

AI tools have been increasingly used in scientific research for various purposes including data analysis, literature
review, and the retrieval of references [1] Some even show great potential in facilitating scientific research. Certain
tools, such as DeepL Write and Github Copilot, are specifically designed for text improvement, while others, like
ChatGPT and Bing, are more versatile.

3 Methodology
3.1 Overview of the Methodology

This section delineates the methodology employed for the comparison of AI tools. The aim is to assess the
capabilities of these tools and evaluate their potential to assist researchers in various tasks. The tools will generate
answers to a broad range of queries and will be assessed using a list of chosen criteria. The assessment will be then
presented in the form of radar graphs.

3.2 Tool Selection

The comparison involved nine distinct AI tools: ChatGPT-3.5, ChatGPT-4, Google Bard, BingAI, YouChat,
QuillBot, DeepL (Write Beta), Graphmaker.ai, and Github Copilot. These tools were selected for being some of the
most advanced and widely used solutions in this field and area of research. This selection includes generative
language models such as ChatGPT and Google Bard as well as some more specific tools like DeepL and Quillbot.

3.3 Task selection

The tasks employed to assess the capabilities of the AI tools are text enhancement, text summarization, the retrieval
of relevant references, and assistance with developing presentations (Writing inspiration source). Additionally, they
were evaluated in code generation, data visualization, and data analysis. These tasks were chosen for their breadth
and relevance in examining the depth and range of capabilities of these AI tools in scientific research.

The exact tasks are as follows:

Code generation: Matrix multiplication using multi-threading. This example is simple yet interesting since it
handles a complex aspect of object-oriented programming.

Text enhancement: Improve the following text: "Generative language models like GPT are changing the way
scientists do research. These models are a type of artificial intelligence that can generate text. They are trained on a
large amount of text data and can generate new text that is similar to the text they were trained on. This is
particularly useful in research because it can help scientists come up with new ideas. For example, if a scientist is
researching a rare disease, they can use a language model to search through thousands of research papers and find
information that might be helpful. This can save a great deal of time. Because these models are so good at
generating text, they can also be used to write articles or reports. This means that scientists can spend more time

3
doing experiments and less time writing. Some people think that in the future these models could even come up with
new theories or make discoveries on their own.”

Text summarization: Summarize the overview of a paper about Microsoft’s DocumentAI (Intelligent Document
Processing) [2]

The retrieval of relevant references: Retrieve relevant papers about people who have used ChatGPT or any of the
used AI tools in their research.

Assistance with developing presentations: After going in detail over the contents of the paper, the methodology
used and the results, I asked the tools to come up with a good structure I can base myself on to write the paper

Paper questions generation: Generate questions on my paper so that I can better prepare myself for the
presentation.

Data visualization: Represent a csv (see Table 1) in different ways to shed light on the different interpretations of a
simple database.

Data analysis: Using the same csv (see Table 1) as the previous task, analyze the data provided through different
tasks and questions. These are:

4
-Revenue Trends: "Generate a line chart to visualize the daily total revenue trends for all categories combined.
Analyze if there are any noticeable patterns or trends."

Category Comparison: "Compare the total revenue generated by the Electronics category to that of the Books
category. Which category has higher revenue and by what percentage?"

-Product Performance: "Identify the product with the highest total revenue and the product with the lowest total
revenue. Provide a bar chart for visualization."

-Correlation Analysis: "Analyze the correlation between the number of units sold and the revenue. Provide a
scatter plot and discuss any trends or patterns you observe."

-Time Series Decomposition: "Perform a time series decomposition on the daily total revenue to analyze its trend,
seasonality, and residuals."

-Revenue Forecast: "Using the given data, create a simple time series model to forecast the total revenue for the
next 7 days."

-Category Contribution: "Create a pie chart to visualize the proportion of total revenue contributed by each
category (Electronics and Books)."

-Product Sales Distribution: "Generate a box plot to visualize the distribution of units sold for each product.
Analyze any outliers or differences between products."

-Day-wise Product Sales: "Create a heatmap to visualize the number of units sold for each product over time.
Analyze if certain products have higher sales on specific days."

-Revenue Efficiency: "Calculate the revenue efficiency of each product as the ratio of total revenue to units sold.
Rank the products based on their revenue efficiency."

Table 1: csv used as a use case for the data analysis/Visualization sections

3.4 Evaluation criteria

The criteria used to rate the capabilities of the AI tools are precision, completeness, context understanding,
creativity, adaptability, and reliability. These parameters potentially allow for a comprehensive evaluation of the
tools, offering an assessment that may align with the qualities anticipated from AI.

3.5 Data Collection and Analysis

For clarity, presentation, and formatting purposes, the research results were represented by a table. (see Table 2) The
five categories of AI tools consist of four generative language models (GPT-4, Google Bard, Bing, and YouChat)
which will be used in all of the tasks, and a flexible spot that will either be filled with a task-specific tool or GPT-3.5

5
if it’s interesting to compare the evolution of generative language models throughout time. Since there is no
specialized tool that can be used in all of the categories (unlike the used generative models in ChatGPT, BingAI ,
YouChat and Google Bard), the last slot was allocated to a set of these tools.

When going over each category:

Code and Data Analysis: Github Copilot was used since it is a tool specialized in coding. It is a code autocomplete
tool that is integrated in one’s IDE. Since it is convenient, it is interesting to see if it brings more than convenience
and ease of reach and if its results are comparable with those of the other tools.

Text enhancement: As of 25.07.2023, DeepL Write is a Beta feature in the DeepL website. It specializes in text
enhancement. Since DeepL is a widely used tool it’s interesting to see how it fares against the studied set of
generative languages.

References: Since it has been observed in our results and findings that generative language models lack good
judgement in choosing relevant references. It is important to observe the advancements of generative language
models in this direction to help provide an insight into the future.

Text summarization: Quillbot is a text-summarization tool that offers two modes (important ideas and summary
paragraph). Including it in our research would give us the opportunity to shed a different light on this task.

Data visualization: Graphmaker.ai is a graph-making AI tool that uses ChatGPT’s API. When it’s provided with a
database in different formats it can generate data according to the prompt given. Since it is specialized in graph-
making and more convenient to use because generative language models (as of July 2023) cannot generate images
and graphs directly.

The two categories “Assistance with developing presentations” and “Paper questions generation” have not been used
since there were no tools specialized in these tasks in particular. In addition to this, the input of the generative
language

The results have been simplified to bullet points so that it is more concise, transparent, and understandable.

6
Task Text Text Assistance Paper Data Data
Code enhancemen References Summarization with questions visualization
Tool t developing generatio
presentation n
s
-Had the - Great - although it stays - Complete, does not - Comes up - -With different - menti
less improvement on topic and does miss any point. with Impressive iterations of the chain o
efficient s not hallucinate, - Too extensive, the innovative questions, query it can - great
code at - Does not the references summary is too long. ideas and is not only provide at most conclus
first miss any idea provided are - Way better when complete in addressing one inaccurate having
GPT-4 - With in the initial informal, which is asked to make the its answer. the graph. small d
an text not suited for text shorter but feels - Was told to multiple -Provides Python mentio
improve - Adapts well scientific less smooth. come up with layers of code for this when a
d query to query research. (None ideas in every the paper purpose since it analyze
it came variation are peer-reviewed section while but also is a text- seasona
up with - too or official keeping some other generation AI, exampl
the most extensive in documentation) metrics in aspects and it cannot - 10/10
efficient its text - when used mind and it like bias in generate graphs.
code. enhancement. without plugins was a good comparing - The graphs are
The text is (internet access) it source of the tools elegant and clear
too long. is way more inspiration. and while the color
careful than GPT- questions scheme used is
3.5 in referencing that go adequate and
non-existent or beyond the logical.
thematically paper.
irrelevant papers.
- Had - Noticeable -Does not want to - adds unnecessary - kept the - The -Provided Python -Makes
compilat improvement provide and redundant metrics I questions code as well. calcula
ion when references no content to a summary mentioned in are well - The code has -5/10 q
errors enhancing the matter how the “Microsoft research how I want to crafted multiple errors
and query query was has been leading the write my and and most of the
Google infinite -At first, it changed. way in Document AI paper and address graphs could not
Bard recursio was just -States that as an research” (see Fig. 6 presented the be generated.
n paraphrasing AI Model, it – Appendix) them as bullet different
- does the original cannot provide - The summary is too points with layers of
not even text. me with scientific short, multiple implementati the paper.
make - Still better references. elements have been on ideas.
use of with an ignored. -Not as
multi- improved creative as
threadin query just GPT-4 but
g like it like the other still does
was models. great work.
asked
-The - deletes -Tags some of the - Good summary, - Clear and - mostly -Same remarks -Answe
code some of the references with straight to the point. simple but generated as GPT-4. formul
was details in the hyperlinks that vague questions -The graphs are -8/10 q
good text. Since it have nothing to structure. on the elegant; the code
from the has no do with them. - does not go methodolo works perfectly,
start and context here -the papers into detail on gy and and the
Bing was not it can delete provided are too how we can conclusion mistakes/inaccur
made important informal. go over these -the acies are rare.
much details unless (Everyday points. questions
better explicitly told queries) were very
with an not to. -when asked generic.
improve specifically about Most are
d query peer-reviewed just asking
papers, it for one’s
provides us with a opinion or
couple of papers. more
While they’re not explanatio
fully relevant it’s ns on a

7
still a notable part of the
improvement methodolo
compared to the gy.
other ai tools.
-Used - Mediocre - artificial - Great summary, - Same - mostly -provided code - does n
one of improvement hallucination smooth transitions, remarks as generated as well. underst
the most - deletes - absurd and straight to the Google Bard questions -The graphs are couple
efficient some of the references which point. though it on the good with no - 5/10 q
ways to details in the have nothing to gives used errors.
YouChat solve the text. do with research. additional methodolo -The color
problem ideas which gy and the scheme is not
(Fork/Jo gives it conclusion adequate (for
in pool) additional and view example two
points in on the distinct
creativity. future. categories have
- generic the same color)
questions.

Github DeepL Write GPT-3.5 Quillbot None None Graphmaker.ai Githu


Copilot -great for -used to highlight -this tool is -It is a tool -precis
-Makes error the improvement specialized in text specialized in code
the detection and of ChatGPT summarizing. data visualization autocom
layout lexical throughout time -has a very flexible so it can directly sugges
for the enhancement. and to compare interface since you generate graphs. -conve
Additiona code -flexible UI: its results to GPT- can choose between -The graphs are it’s inc
l - Very word and 4. 4 lengths (very short/ clearer and more IDE.
convenie sentence edit - artificial short/ long/very long) elegant since it is -the on
Tool
nt to use mode. hallucination. and 2 modes (key the tool’s domain that it n
- creates -lacks context - Some of the sentences/paragraphs of expertise. guided.
non- recognition references ) -since it is not a comple
provided which limits provided are -the summary feels language model tasks w
classes. its potential. geology and like putting together it does not cannot
astrophysics key sentences rather understand task tha
paper, we can than a smoothly prompts as well multipl
conclude that it’s written paragraph. as other ai tools, tasks.
factually and - limited to 600 consequently, it
thematically words can only
inaccurate. comprehend
- even the simple prompts
computer science and make
papers are either uncomplicated
outdated or out of graphs.
topic.
Table 2: Comparison of tools

4- Achieving Significance: Implications of the Findings


The evaluation findings, as previously discussed, offer pivotal insights into AI's current capabilities and limitations
within scientific research, and underscore the importance of judicious and ethical integration of these tools.

4.1 Enhanced Efficiency in Research Approaches

As observed from our comparison of tools, AI tools, notably GPT-4, Bing, and Github Copilot, have shown
significant potential in automating and enhancing various aspects of the research process. While being efficient to
different degrees across different tasks, the general observation is that they can simplify manually demanding tasks
like coding, text manipulation, and data visualization. This implies that researchers can now allocate more time to
critical thinking and analysis, rather than spending it on repetitive tasks.

Tools like Github Copilot can be of immense help to developers since it’s an efficient and convenient code auto-
complete tool. Furthermore, tools like GPT-4 have shown great potential in the various aspects of text manipulation

8
due to their impressive context understanding, creativity, and adaptability to the different tasks and queries presented
to them.

Furthermore, generative language models – in particular GPT-4 – have shown great potential as sources of
inspiration, since they can be of great help with the structure of a paper or a presentation. While being a great tool to
construct, it is also capable enough to deconstruct, and that is by analyzing work and generating questions and
discussions around it.

4.2 Need for Critical Evaluation of AI Outputs

While AI tools have shown promise in facilitating research, it is crucial to approach their outputs with caution. As
discussed in our tool comparison table, these tools are prone to making mistakes and in some cases hallucinations.

One way we can highlight the blatant artificial hallucination is by looking for relevant references for research
purposes. Tools like YouChat have been providing false references and others like GPT-4 and Bing have been
providing papers completely irrelevant to the paper or just simply too informal to be used in a scientific context.

This underscores the importance of human oversight and critical evaluation of the outputs generated by AI tools to
ensure their accuracy and relevance.

4.3 Unexpected Findings and Their Implications

While the study confirmed the utility of AI in research, an unexpected finding was the extent to which some of the
AI tools could generate creative outputs. This has implications for innovation in research, as AI tools could
potentially contribute novel perspectives or solutions.

GPT-4 was particularly interesting because of its creativity. It has proven to be an impressive source of inspiration
since it can come up with creative ways to solve some tasks like assistance in structuring a presentation or overall
ideas on how to explain a concept to a specific audience.

Although it shows a significant improvement in comparison to its predecessor GPT-3.5 by avoiding artificial
hallucination, GPT-4 still shows a lack of good judgment regarding how relevant or interesting a paper can be for a
specific research topic. Additionally, it still has trouble discerning between formal and informal sources.

4.4 Ethical and responsible use of AI

The use of AI in scientific research also brings to the fore ethical considerations. For instance, ensuring that the AI
models do not perpetuate biases and are used responsibly. Researchers and developers must keep these ethical
complications in mind.

Through intensive research and careful comparison of tools, bias can rarely be an issue in factual scientific research.
And it falls into the same basket as a critical evaluation since it includes checking it for potential bias.

5- Evaluating AI Tools: Performance ratings


5.1 Rating criteria

The evaluation of the tools employed in this study hinges on six distinct, yet interrelated criteria, each of which is
fundamental in gauging the efficacy and robustness of AI tools. These criteria are indispensable for a comprehensive
assessment, as they encapsulate the essential attributes expected of AI systems. The criteria are rated on a score of 0
to 5 and are enumerated as follows:

Precision: Precision is vital for ensuring that the AI tool's outputs are accurate and relevant, minimizing erroneous
or irrelevant information. A score of 0 is synonymous with completely false information, which isn’t the case for
any of the tools, otherwise they would not have been chosen as a use case. On the other hand, a perfect score of 5
means that the tool comes up with perfect answers to any question given and provides extremely relevant resources
to a certain query. Which isn’t the case since all of the tools struggle with fetching references.

9
Completeness: Completeness is essential for ensuring that the AI tool provides all the necessary information,
leaving no gaps that could hinder understanding or decision-making. A minimal score of 0 would mean that the tools
come up with outputs that are incomplete in the number of points they go over. Certain tools may at times –
especially in the text handling category – leave out some of the points mentioned in the prompt. A perfect score of 5
means that the tool goes over all of the requirements of the query and doesn’t leave questions unanswered, even if
they may provide inaccurate results at times.

Context understanding: Context understanding is critical as it enables the AI tool to interpret the nuances and
underlying meanings in queries, ensuring more human-like responses. A score of 0 means that the tool does not take
context into consideration when dealing with a problem. An example may be that when enhancing a text, DeepL
does not enhance the text contextually unlike other tools like GPT-4.

Creativity: Creativity is important for AI tools to generate innovative solutions or ideas, which can be particularly
beneficial in solving problems and being a source of inspiration. A score of 0 means that the tool does what it is
asked to do strictly without coming up with additional ideas. A maximum score of 5 means that the tool is capable
of being a research partner by giving additional ideas and recommendations that might improve the quality of the
result.

Adaptability: Adaptability is key for AI tools to adjust to different tasks and data types, ensuring versatility and
applicability on a wide range of queries. A minimum score of 0 would mean that the tool is very limited in what it
can do and doesn’t cover a wide range of tasks, which wasn’t the case for any tool since every tool can handle a
wide range of texts and tasks. More specialized tools are tested on their adaptability by handling different inputs like
DeepL being able to enhance a wide array of texts. A maximum score of 5 shows flexibility in the answers. This
criterion was rated more harshly for generative language models since they are expected to perform better in this
domain due to being trained on a wide arrange of queries and text manipulation.

Reliability: Reliability is fundamental to ensure that the AI tool consistently performs well and can be depended
upon for critical tasks and decision-making. A minimum score of 0 shows that the tool does not match its
expectations. It has been rated differently for more specialized tools since we can only expect from a text
summarization tool like Quillbot to be able to summarize texts, while generative language models like GPT-4 and
Bing are expected to perform well when dealing with various tasks, and this is where all of them fail to reach a
perfect score of 5 since all of them performed poorly in the “References” category. This score has been reached by
Github Copilot since it’s the only tool that does exactly what it’s supposed to do (being a code autocomplete tool)
while being convenient to use.

5.2 Overview of the Ratings

Using the aforementioned criteria, a rating system has been made for clarity in the form of radar charts.

10
Fig. A: Rating of GPT-3.5 Fig. B: Rating of GPT-4

Fig. C: Rating of Bing Fig. D: Rating of Bard

Fig. E: Rating of YouChat Fig. F: Rating of DeepL

11
Fig. G: Rating of Github Copilot Fig. H: Rating of Quillbot

Fig. I: Rating of Graphmaker.ai

6- Conclusion
This paper embarked on an exploratory journey to assess the capabilities of AI tools in scientific research, with a
particular focus on generative language models. Through a systematic evaluation based on precision, completeness,
context understanding, creativity, adaptability, and reliability, we have uncovered the strengths and limitations of AI
tools such as ChatGPT, Google Bard, DeepL, and Github Copilot.

Our findings reveal that AI tools, especially GPT-4, have made significant progress in automating and enhancing
various aspects of the research process, including text manipulation, and data visualization. This has implications for
efficiency, as researchers can allocate more time to critical thinking and analysis rather than repetitive tasks.
Moreover, the creativity exhibited by some tools, particularly GPT-4, opens up new avenues for innovation in
research.

However, it is imperative to approach the outputs of these tools with caution. The evaluation highlighted instances
of inaccuracies and artificial hallucinations, particularly in the retrieval of references. This underscores the
importance of human oversight and critical evaluation to ensure the accuracy and relevance of AI-generated content.

12
Furthermore, the ethical use of AI in research is paramount. Ensuring that AI models do not perpetuate biases and
are used responsibly should be at the forefront of considerations for researchers and developers alike.

Addressing the research gap, this study contributes to the literature by providing a systematic comparison of AI
tools, and it complements existing research by providing empirical data on the performance of AI tools in scientific
research.

Questions could be raised on how objective the methodology is (regarding the criteria, the tools and tasks used and
the rating system). The method was an attempt at being as objective as possible since the criteria chosen represent
characteristics we expected from ai and generative language models in particular. But it is possible that the
reproduction of these prompts may deliver different answers, since the tools are often updated.

Looking ahead, the rapidly evolving nature of AI indicates that the tools evaluated will continue to evolve. The
noticeable enhancement in quality and precision between GPT-3.5 and GPT-4 is a testament to this. Continuous
evaluation and adaptation are necessary to keep pace with advancements in AI.

In conclusion, AI tools hold great promise in revolutionizing scientific research. However, their integration should
be approached judiciously, with an emphasis on critical evaluation, ethical considerations, and adaptability to the
ever-evolving landscape of artificial intelligence. This study sets the stage for future research to build upon these
findings and develop more advanced and integrated tools for scientific research.

7– References
7.1 General references

1. Xiao, D. (2023, June 30). AI-Based Literature Review Tools. Texas A&M University Libraries. Retrieved July 3,
2023, from https://tamu.libguides.com/c.php?g=1289555.

2. Microsoft. (n.d.). Document AI (Intelligent Document Processing). Retrieved July 3, 2023, from
https://www.microsoft.com/en-us/research/project/document-ai/

3. Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., &
Dean, J. (2017). Device Placement Optimization with Reinforcement Learning. In Advances in Neural Information
Processing Systems 30 (NIPS 2017). Retrieved July 25, 2023, from
https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

4. McCarthy, J., Minsky, M., Rochester, N., & Shannon, C. (2006). A Proposal for the Dartmouth Summer Research
Project on Artificial Intelligence. Stanford University. Retrieved July 25, 2023, from https://www-
formal.stanford.edu/jmc/whatisai.pdf.

5. IBM. (n.d.). Natural Language Processing. Retrieved July 25, 2023, from https://www.ibm.com/topics/natural-
language-processing.

6. Google. (2023, January 31). Ask a Techspert: What is generative AI?. Google Blog. Retrieved July 25, 2023,
from https://blog.google/inside-google/googlers/ask-a-techspert/what-is-generative-ai/.

7.2 Ai tools used in the paper

7. OpenAI. (2023). ChatGPT (GPT-4) (May 24 version) [Large language model]. Retrieved from
https://www.openai.com

8. OpenAI. (2023). ChatGPT (GPT-3.5) (May 24 version) [Large language model]. Retrieved from
https://www.openai.com

13
9. Google. (2023). Google Bard (May 15 version) [Large language model]. Retrieved from https://bard.google.com

10. Bing. (2023). BingAI (version concurrent to July 3) [Large language model]. Retrieved from
https://www.bing.com

11. You. (2023). YouChat (July 3 version) [Large language model]. Retrieved from https://you.com

12. DeepL. (2023). DeepL Write (version concurrent to July 3). Retrieved from https://www.deepl.com/write

13. Graphmaker. (2023). Graphmaker.ai (version concurrent to July 3). Retrieved from https://www.graphmaker.ai

14. Github. (2023). Github Copilot (version concurrent to July 3). Retrieved from https://github.com

15. Quillbot. (2023). Quillbot (version concurrent to July 3). Retrieved from https://quillbot.com/summarize

7.3 Additional references present in the appendix

16. Checco, A., Bracciale, L., Loreti, P. et al. AI-assisted peer review. Humanit Soc Sci Commun 8, 25 (2021).
https://doi.org/10.1057/s41599-020-00703-8

8. Appendix
8.1 Text improvement

Changes in the prompt are synonymous with better results. These tools are large language models after all, a more
precise prompt will lead to better answers.

14
15
Figure 1: GPT-4 on text improvement

16
Figure 2: Google Bard on text improvement before VS after improving the query

17
Figure 3: Bing on text improvement

Figure 4: You on text improvement

18
8.2 Text summary

Summarize the contents of the overview of a paper on DocumentAI [2].

Figure 5: GPT-4 on text summarization

Figure 6: Google Bard on text summarization

19
Figure 7: Bing on text summarization

20
Figure 8: Quillbot on text summarization : Key sentences VS Paragraph mode

21
Figure 9: You on text summarization

8.3 Help in structuring the research paper

After going in detail over the contents of the paper, the methodology used and the results, I asked the tools to come
up with a good structure I can base myself on to write the paper. The points I need to go over were presented in form
of bullet points.

22
23
Figure 10: GPT-4 in writing inspiration

24
Figure 11: Google Bard in writing inspiration

25
Figure 12: Bing in writing inspiration

26
Figure 13: You in writing inspiration

8.4 Fetching references

The tools were asked to look for papers relevant to my seminar topic.

27
Figure 14: Google Bard on scientific references

Figure 15: GPT-4 without internet access when asked to fetch references

28
Figure 16: GPT-3.5 when asked to fetch references on the seminar topic (All the references are fabricated. Consequently, I have
not added the links GPT-3.5 provided.)

Figure 17: Bing when asked to fetch references [16] (2. Is redundant since it bases itself on 1.)

29
Figure 18: You when asked to fetch references (All the references are fabricated. Consequently, I have not added the links
YOU.com provided)

8.5 Data Visualization

When presented with a simple dataset (see Table 1), the tools were asked to generate graphs to present the data in
different ways.

30
31
32
Figure set 1: Graphmaker.ai on data visualization

33
34
35
Figure set 2: GPT-4 on data visualization

36
37
38
Figure set 3: You on data visualization

8.6 Data Analysis

When presented with a simple dataset (see Table 1), the tools were asked to analyze this data, being faced to
different questions and tasks.

BingAI on the data analysis questions: Sure! Here are the answers to your questions:

39
1. *Revenue Trends*: From the line chart, we can see that the total revenue fluctuates over time, with some days
having higher revenue than others. There does not appear to be a clear upward or downward trend in the data.

2. *Category Comparison*: The total revenue generated by the Electronics category is $8,350, while the total
revenue generated by the Books category is $600. The Electronics category has higher revenue than the Books
category by 1291.67% (8350/600*100).

3. *Product Performance*: The product with the highest total revenue is the Smartphone, with a total revenue of
$5,400. The product with the lowest total revenue is Non-fiction books, with a total revenue of $220. Here is a
Python script that generates a bar chart to visualize the total revenue of each product:

4. *Correlation Analysis*: From the scatter plot, we can see that there is a positive correlation between units sold
and revenue, meaning that as the number of units sold increases, so does the revenue.

5. *Time Series Decomposition*: From the decomposition plot, we can see that there is no clear trend or seasonality
in the data. The residuals show some fluctuations over time.

6. *Revenue Forecast*: The forecasted total revenue for the next 7 days is $1,040 per day.

7. *Category Contribution*: From the pie chart, we can see that Electronics contribute 93.3% of total revenue while
Books contribute 6.7%.

8. *Product Sales Distribution*: From the box plot, we can see that the median number of units sold is highest for
Headphones and lowest for Non-fiction books. There are no outliers in the data.

9. *Day-wise Product Sales*: From the heatmap, we can see that Smartphone sales are highest on 2023-01-01 and
2023-01-03, while Laptop sales are highest on 2023-01-02 and 2023-01-05. Headphone sales are highest on 2023-
01-04.

10. *Revenue Efficiency*: The revenue efficiency of each product is calculated as follows:

Smartphone: $5,400 / 18 = $300 per unit

Laptop: $3,500 / 5 = $700 per unit

Headphones: $450 / 15 = $30 per unit

Fiction: $380 / 19 = $20 per unit

Non-fiction: $220 / 11 = $20 per unit

Ranking the products based on their revenue efficiency, we get the following order: Laptop, Smartphone,
Headphones, Fiction, and Non-fiction.

GPT-4 on the data analysis questions: Certainly! Let's go through each prompt and analyze the data:

1. *Revenue Trends*: - The revenue trends can be visualized using a line chart. From the data, it seems that the
revenue is fluctuating over the days. There isn't enough data to identify a clear trend, but visualization would help in
identifying any patterns if the dataset was larger.

2. *Category Comparison*:

- Total revenue for Electronics: 3000 + 2100 + 2400 + 450 + 1400 = 9350

- Total revenue for Books: 100 + 80 + 120 + 140 + 160 = 600

40
- Electronics has higher revenue. The percentage by which Electronics revenue is higher than Books revenue is
(9350 - 600) / 600 * 100 ≈ 1458.33%.

3. *Product Performance*:

- The product with the highest total revenue is Smartphone with 5400 (3000 + 2400).

- The product with the lowest total revenue is Non-fiction books with 220 (80 + 140).

4. *Correlation Analysis*:

- A scatter plot would show the relationship between units sold and revenue. From the data, it seems there might
be a positive correlation - as units sold increase, so does revenue.

5. *Time Series Decomposition*:

- With only 5 days of data, time series decomposition isn't very meaningful. Typically, you would need a longer
time series to analyze trends, seasonality, and residuals.

6. *Revenue Forecast*:

- With such a small dataset, it's not practical to create a reliable time series model for forecasting. In practice, you
would use methods like ARIMA or exponential smoothing, but these require a larger dataset.

7. *Category Contribution*:

- Electronics contribute about 93.9% of the total revenue, while Books contribute about 6.1%.

8. *Product Sales Distribution*:

- A box plot would show the distribution of units sold for each product. From the data, it seems that Headphones
might have the widest distribution with 15 units sold in one day.

9. *Day-wise Product Sales*:

- A heatmap would show the number of units sold for each product over time. From the data, it seems that
Electronics products tend to have higher sales compared to Books.

10. *Revenue Efficiency*:

- Smartphone: 5400 / 18 ≈ 300

- Laptop: 3500 / 5 ≈ 700

- Headphones: 450 / 15 ≈ 30

- Fiction: 380 / 19 ≈ 20

- Non-fiction: 220 / 11 ≈ 20

- The product with the highest revenue efficiency is Laptop.

Please note that for some of the prompts, visualizations are mentioned. Since I can provide text-based responses, I
have described what those visualizations would likely show. Also, some analysis tasks such as time series
decomposition and forecasting require a larger dataset to be meaningful.

You on the data analysis questions:

41
42
43
44
45
46
47
Google Bard on the data analysis questions:

48
49
50
51
52
53
54
55
56
57

You might also like