EVALUATING THE CONSISTENCY OF CHATGPT RESPONSES ACROSS REPEATED PROMPTS

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

E VALUATING THE C ONSISTENCY OF C HAT GPT R ESPONSES

ACROSS R EPEATED P ROMPTS

A P REPRINT

Mikkel Broch-Lips David Lindahl Carl Svejstrup


Technical University of Denmark Technical University of Denmark Technical University of Denmark
Lyngby, Denmark Lyngby, Denmark Lyngby, Denmark
s234860@dtu.dk s234817@dtu.dk s234840@dtu.dk

June 24, 2024

A BSTRACT
This study investigated the consistency of responses from OpenAI’s GPT-4, focusing on how the
nature a question affects reliability of the response. Consistency in AI responses is crucial for
applications requiring reliable and repeatable outputs, such as customer service automation. A few-
shot-prompting approach was used to generate both subjective and objective questions with simple
top-10 answers and to generate responses. Our results indicated that GPT-4 provides more consistent
responses for objective questions (mean ranked Jaccard similarity of 0.59) compared to subjective
ones (mean ranked Jaccard similarity of 0.33). A Kruskal–Wallis test confirm a significant difference
in consistency between the two types of questions. This can prove insightful when developing
AI-applications that rely on consistent responses. Future research could consider testing LLM’s
across different temperatures, or make more general prompts, to improve the generelization of the
results.
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

1 Introduction

Large Language Models (LLMs), such as OpenAI’s GPT-4o, are pre-trained models that generate text by predicting the
next word based on a given prompt. These models have revolutionized natural language processing with their advanced
text-generation capabilities. However, LLMs can show variance in output for the same prompt, raising concerns about
their reliability and consistency.
This project investigated the consistency of responses from OpenAI’s GPT-4o, assessing if the model provides
consistent answers when asked the same question multiple times in different sessions. We dividided the prompts into
two classes: Objective questions (one answer better than others) and Subjective questions (room for interpretation). The
two types of questions were compared using different similarity metrics across repeated prompts. The questions had
top-10-answers to focus on the conclusion instead of the formulation of the models answers. Few-shot-prompting was
used to generate questions and results.[1]

This revealed how consistent state-of-the-art LLMs are in their answers, which can prove insightful when developing
applications that rely on consistent and reliable AI-generated outputs, such as customer service chatbots, automated
content creation, etc. Understanding the consistency of these models helps optimizing their use in various real-world
applications, ensuring better performance and improved user experiences.

2 Literature Review

Research on the consistency of responses from LLMs is crucial for applications requiring reliable outputs. Here, we
review key findings from recent studies that examine this consistency across different conditions.
In the study “Methods to Estimate Large Language Model Confidence,” (2023) by Maia Kotelanski , GPT-4’s responses
were evaluated by repeating the same prompt 11 times, achieving a consistency percentage of 42%. This consistency
rate was determined using a diverse set of challenging clinical case questions and was calculated based on the frequency
of the most common answer across multiple runs. This method allowed the researchers to measure how reliably the
model produced the same or most frequent output. It is noteworthy, that they used a temperature of 1.0, and a more
strict consistency measure. We therefore expect our results to be higher than their 42%. [2]
In “Dissecting Similarities in Self-Consistency: An Analysis on Impact of Semantic Consistency on Language Model
Reasoning,” (2024) researchers compared GPT-3.5’s self-consistency at a fixed temperature and evaluated it across five
different temperature settings (0.2, 0.4, 0.6, 0.8, 1.0). They found a consistency of 46.50% with a fixed temperature and
48.54% when varying the temperatures. To achieve these results, they vectorized the output and used a cosine similarity
measure to evaluate the consistency between responses.[3]
These studies provide valuable insight, into the levels of consistency we can expect from our own experiment, although
the nature of our experimental design varies from theirs as we use a top-10-questions approach.

3 Methodology

3.1 Data Collection and Experimental Design

To compare the consistency of GPT-4o’s output between subjective and objective questions, we followed the experimen-
tal structure depicted in Figure 1. We prompted GPT-4o with questions, which it responded to by generating top-10
lists as answers. The questions were designed with concrete and clear semantic rules to distinguish between the two
question classes. To avoid errors in similarity calculation (Example: Python saying the strings “41◦ ̸= 41.0◦ ” are not
the same), we implemented a few-shot prompting strategy, seen in Appendix B, to ensure equal and clean output format
from GPT-4o.

3.1.1 Question Determination


When designing the format of the questions, we focused on ensuring the consistency of GPT-4’s conclusions. Notably,
when the same question is asked multiple times, GPT-4 often provides different answers, even though the content
remains essentially the same. To minimize this variability, we chose short-answer questions with responses that involved
proper nouns or numbers. Additionally, we utilized top-10 questions to better apply similarity theory. Our aim was to
generalize the findings as much as possible; therefore, we created a wide range of questions, varying both in topics and
phrasing. See Figure 4.

2
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

The objective questions were defined as questions with only one true answer, such as "Most sold books in 2020.". Here
there are only one measurement to consider: the number of books sold, to determine the top 10.
Subjective questions, however, can vary based on personal perspectives and preferences, with questions such as "Most
beautiful tourist destinations" and "Most popular musician of all time". In the last example this could for instance be
determined both with most sold albums and Streams on Spotify, and therefore is therefore subjective to one answering.
A set of rules was established to guide the formulation of these questions:
Rule set for questions

• Objective Questions: Each question should have only one true answer based on verifiable data with no room
for interpretation.
• Subjective Questions: Each question should be specific yet open to interpretation, allowing for multiple
answers that may vary depending on the respondent.
• All Questions: Each question should result in a ranked top 10 list.
• All Questions: Ensure the questions cover a wide range of topics to represent various fields.

We generated 132 questions with GPT-4o using few-shot prompting. The few-shot prompts are shown in subsection A.2
and Appendix A. We used the appropriate part of the rule-set to prompt each question class with 10 highly diverse
questions as few-shot examples. The example questions naturally had different degrees of interpretability and difficulty.

Examples: "Give top 10 [question]"

Subjective Questions Objective Questions


Most beautiful beaches in Thailand Shortest serving US Presidents
Most terrible pandemics before 1900 Most viewed TED Talks
Best national soccer teams in the 1980s Oldest universities in the world
Table 1: Examples of Subjective and Objective Questions

Figure 1: A comprehensive illustration of the experimental design for evaluating LLM consistency, showcasing key
components and workflow.

3
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

3.2 Model Configuration

The area of LLMs is diverse, with numerous models available, each displaying varying levels of consistency in their
outputs. In this paper, we considered the state-of-the-art GPT-4o model from OpenAI. The model was implemented
using the OpenAI API, which allowed for significant scaling of our dataset. It also ensured independence between
each answer to avoid context carryover and potential bias. This method ensured the integrity and variability required
for robust statistical analysis. By using the API model, we were able to adjust various model parameters, including
temperature, top-p, frequency penalty, and others. However, we primarily focused on tuning the temperature parameter,
as modifying other parameters was beyond our current knowledge in model tuning.
We aimed to set the temperature as close as possible to the default setting used in the publicly available ChatGPT-4o
model to ensure our results would be comparable to those of commercially available models. While specific details
about the temperature setting for ChatGPT-4 are not publicly available, it is generally understood to be around 0.7 to
0.8 based on available sources.[4] The temperature for this study was therefore set to 0.7.

3.3 Experimental Design

3.3.1 Consistency metric design


When deciding the metrics for similarity we chose two variations of the Jaccard index: Jr (Ranked Jaccard ) to evaluate
the consistency of the ranked lists generated by the ChatGPT-4o, and Ju (Unranked Jaccard) to evaluate the common
items in the top 10 lists.

|Ar ∩ Br | |Au ∩ Bu |
Jr (A, B) = , Ju (A, B) =
|Au ∪ Bu | |Au ∪ Bu |

Au ∪ Bu is the union of two top 10 lists without considering ranking information. For example: "1 Lion" = "2 Lion".
On the other hand, Ar ∪ Br is the union of two top 10 lists with ranking information taken into account. For example:
"1 Lion" ̸= "2 Lion".
For each question, we calculated the two similarity measures pairwise for the responses. These measures was then used
to estimate the mean consistency for each question. This approach ensures that we capture all information in the data
and makes it easier to replicate the results with a different number of repeats.
Jr was designed to capture both rank, but also takes into account when two similar ranked answers are in different ranks.
This is done by not including the rank in the unique elements (the denominator). The Jr holds the most information and
will be the main similarity measure used in this study. The Ju is the more conventional Jaccard index and will mostly
be used to nuance the output of Jr and to look for possible outliers in the data.

3.3.2 Sample Size Estimation


To ensure the reliability and validity of our study, we used the conventional method of calculating sample size through a
pilot study of 10 different prompts for each question class. We were mostly interested in comparing the Ranked Jaccard
means, so we used this metric to calculate the the sample size:

2p(1 − p)(Zβ + Zα/2 )2


n=
(pA − pB )2

We chose p(p − 1) to be the largest of the two estimated standard deviations for the Jr question-means. Zβ = 0.8 and
Zα/2 = 1.96 with alpha on 5%.
n = 91.6

So given this estimation of n, we needed a minimum of 92 different questions for each class.
Given the wide scope of our investigation and our aim to draw generalizable conclusions about ChatGPT-4o, a large
data set is crucial to reduce bias and improve the generalizability. Also we expected to find technical issues with some
questions that would be removed. Therefore we ended up adding around half of the estimated sample size giving us
n = 132 different questions for each class.

4
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

3.3.3 Response Collection


Each of the 132 questions was prompted GPT-4o 11 times. This repetition allowed for a robust assessment of response
consistency across multiple iterations of the same prompt. We removed three questions from the response dataset,
ending in: 2 x 11 x 132 = 2904 data points (top-10-lists), which were used to test the consistency for each prompt, under
the different question classes.

4 Results
4.1 Presentation of results

To determine the difference in consistency between the two question classes, we evaluated the following H0 hypothesis:
"No significant difference in similarity between question classes". The hypothesis was tested both for Jr and Ju
similarities. All tests were evaluated with a 5% significance level, using the Kruskal–Wallis independent test1 . We
implemented non-parametric tests, based on the distributions in the similarity data, shown in Figure 2. Figure 2
illustrates a histogram of the similarity metrics for both question categories. While the distributions of the similarity
metrics are fairly comparable for each category, none of them appear to be normal distributed. To further evaluate
normality, we quantified the assumption using a Shapiro-Wilk test. The results in Table 2 verified, that the similarity
data was not normal distributed. We also analysed a qq-plot, seen in Appendix C.

Figure 2: Histogram of similarities for subjective and objective questions

Shapiro-Wilk Test: p-value


Metric Objective Subjective
−5
Jaccard Ranked 2.392 × 10 7.067 · 10−3
−5
Jaccard Unranked 1.383 × 10 7.950 · 10−4
Table 2: Shapiro-Wilk test p-values for the similarity measures.

4.2 Confidence Intervals

Since our data was not normally distributed, we calculated the confidence interval for the mean similarities via non-
parametric bootstrapping. We resampled the data 10,000 times. Table 3 depicts the mean accuracies, along with the
bootstrapped margin of error (MOE). The confidence interval was also visualized in Figure 3.2 The confidence intervals
might not provide a satisfactory picture of the relations due to the unknown distributions and a large concentration
1
Independence was established based on prompting design
2
We verified whether the confidence intervals were symmetric, and they were, within three decimal places.

5
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

around 1.0 in the similarity measures of the objective class. Therefore we performed a statistical test to investigate our
results further.

Confidence Interval: Mean and Error


Metric Objective Mean Subjective Mean
Jaccard Ranked 0.592 ± 0.045 0.334 ± 0.031
Jaccard Unranked 0.748 ± 0.034 0.597 ± 0.039
Table 3: Consistency metrics for objective and subjective questions.

Figure 3: Non-parametric bootstrapped 95% confidence intervals for question-means similarities

4.3 Statistical test

The results from the Kruskal–Wallis test are displayed in Table 4. The table shows the p-values under H0 , for both Jr
and Ju . The test rejects H0 as the p-values are below the acceptable significant level of 0.05. This indicates a significant
difference between the question classes, for both types of similarity measures.

Jaccard Ranked Jaccard Unranked


Kruskal–Wallis p-values 3.433e-14 2.511e-07
Table 4: p-values from two Kruskal–Wallis test. Testing for difference between question class

5 Discussion
5.1 Discussion of results

The estimated Ranked Jaccard mean similarity for objective and subjective questions had a mean of 0.592 and 0.334
(see Table 3). These results indicate, that on average, 5.92 out of 10 items where consistently where in both lists and on
the same rankings for objective questions, whereas for subjective questions, the consistency was lower at 3.34 out of 10.
The estimated Unranked Jaccard mean Similarity for objective and subjective questions had a mean of 0.748 and 0.597
(see Table 3). These results indicate, that on average, 7.48 out of 10 items were consistently present in both lists for

6
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

Figure 4: Chart of Topics

objective questions, while for subjective questions, the consistency was slightly lower at 5.97 out of 10. These results
indicate that the model performs better in maintaining consistency for objective questions compared to subjective ones.
The results of the Kruskal Wallis test provided statistical evidence of a significant difference in the consistency of
responses between objective and subjective questions. The bootstrapped 95% confidence interval further supported
these findings.
The significant p-values for both Jr and Ju metrics, in Table 4, led to the rejection of the null hypothesis (H0 ) of
no significant difference in consistency. The bootstrapped confidence intervals for the Jr and Ju , in Figure 3 and
Table 3, did not overlap which further emphasized a significant difference in response consistency. The higher similarity
accuracy for the Unranked Jaccard compared to the Ranked Jaccard was expected due to the additional condition of
matching the order of elements in the Ranked Jaccard metric, as described in subsubsection 3.3.1. The narrow margin
of error for all similarity metrics indicates the robustness of the analysis.
These findings highlight the importance of question phrasing when interacting with AI models. The choice between
using subjective or objective language can significantly affect the model’s performance and reliability. Ensuring
questions are clear and specific can help improve the consistency and usefulness of AI-generated responses.

5.2 Limitations

Our experimental design has some limitations as we base the consistency measure on top-10 questions. Future research
could include a wider variety of questions, such as different types of ranked top 10 lists, single-answer questions, and
prompts requiring ChatGPT to generate creative responses. This approach would provide a more comprehensive view
of the overall consistency.
Also, the classification of questions into subjective and objective classes provides only a limited understanding of how
consistency differs between these types. But it is challenging to create a continuous measure that accurately spans the
spectrum from objective to subjective.
With our experimental design, it is inevitable to introduce a small amount of bias when creating the questions. However,
we used a rather large data set of questions, which reduced this bias and improved the generalization. To check for this,
we visualized the effect by examining the frequency of repeated topics (Figure 4). It seems like questions revolving

7
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

TV-series, Movies and sports are over-represented in our data. While this might not significantly influence the outcome,
it could also potentially affect the generalizability of the results.
Another limitation to consider is the configuration of the GPT-4o model. One of the most crucial parameters is the
temperature, which significantly affects response consistency. In Appendix D, boxplots show similarity measures from
objective questions with temperature settings of 0.2 and 0.7. The results clearly indicate that lower temperatures result
in higher similarity metrics. Due to OpenAI keeping their temperature settings secret, we do not know the default
temperature or whether GPT-4o uses a dynamic approach—raising the temperature for more interpretable tasks and
lowering it for more objective prompts. Until OpenAI publishes their temperature settings, our method cannot draw
final conclusions about the consistency of the publicly available ChatGPT-4. It is important to emphasize that this paper
is not intended to achieve the best consistency 3 , but rather to evaluate two question types with the chosen temperature
setting. That said, it could be interesting for future studies to explore the impact of varying temperatures to better
understand their influence on response consistency.

6 Conclusion
Our experiment found that ChatGPT-4o is significantly more consistent at answering objective questions compared to
subjective ones. With ‘Jaccard Ranked Similarities’ ranging from 0.334 to 0.592, we can infer that GPT-4o provides
more consistent answers for objective questions than for subjective ones. These results highlight the importance
of fact-checking information from ChatGPT. However, we must acknowledge that knowing the correct temperature
settings, and only having top 10 questions limit our findings.
Future researchers should consider using a cross-temperature validation approach, testing consistency at different
temperature settings, which would produce more generalizable results.

References
[1] Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry
Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya
Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz
Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever
Dario Amodei Tom B. Brown, Benjamin Mann. Language models are few-shot learners. arxiv, 2020.
[2] Maia Kotelanski, Robert Gallo, Ashwin Nayak, and Thomas Savage. Methods to estimate large language model
confidence. arXiv preprint arXiv:2312.03733, 2024.
[3] Anonymous ACL submission. Dissecting similarities in self-consistency: An analysis on impact of semantic
consistency on language model reasoning. OpenReview.net, 2024.
[4] GPT Workspace. Temperature in the ai world. a short guide on how to use openai temperature parameter for better
chatgpt responses (both in gpt-3 and gpt-4), 2023. Accessed: 2024-06-23.
Link to code: https://github.com/DavidLindahl/02445_report

3
Set temperature to 0

8
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

7 Appendix

A Prompt for generating subjective questions


A.1 Subjective prompt

I need to generate subjective questions for a study that tests the consistency of chatGPT-4o.

The rule set:


- Each question should be specific yet open to interpretation,
allowing for multiple answers that may vary depending on the respondent.
- Each question should result in a ranked top 10 list.

Format:
“Give a list of the top 10 [question]”

Examples of [question]:
- most popular dogs in the USA
- most innovative cars for its time
- strongest insects in the world
- best national soccer teams in the 1980s
- most popular desserts in 2024
- most beautiful beaches in Thailand
- most influential philosophers of all time.
- worst architectural designs of famous buildings
- most terrible marketing campaigns by major brands
- worst acting performances by A-list celebrities

Give me 120 more [question] following the rule set and inspired by the examples.
Do not include "Give a list of the top 10".
Ensure the questions cover a wide range of topics to represent various fields.

A.2 Objective prompt

I need to generate objective questions for a study that tests the consistency of chatGPT-4o.

The rule set:


- Each question should be specific yet open to interpretation,
allowing for multiple answers that may vary depending on the respondent.
- Each question should result in a ranked top 10 list.

Format:
“Give a list of the top 10 [question]”

Examples of [question]:
- Most voted politicians in the Danish election 2019
- Most sold books in 2020
- Universities in the US, with the highest average exam grades in 2021
- Most sold phone models in 2020
- Top-ranking universities in the world according to QS World University Rankings 2021
- Most cited scientific paper
- Largest countries by inhabitants in 2022
- Fastest lap times on Silverstone of all time
- Highest temperatures measured in Germany of all time
- Highest points in Argentina

Give me 120 more [question] following the rule set and inspired by the examples.
Do not include "Give a list of the top 10".

9
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

Ensure the questions cover a wide range of topics to represent various fields.

B Preprompt Instructions
Few-shot prompt for generating responses

Please provide a ranked list of the top 10 [category] in the following format,
when a category is given.
It should NOT include the ordered number in front of the answer.
Ensure consistency in output by using a single format for each item,
avoiding any additional descriptors or alterations.
Each response should be a comma-separated list of names or items,
with no extra text or commentary.
We are testing the output for consistency.
Therefore, every ranking item needs to ONLY INCLUDE THE TITLE.

[Item 1],
[Item 2],
[Item 3],
[Item 4],
[Item 5],
[Item 6],
[Item 7],
[Item 8],
[Item 9],
[Item 10]

Do not provide any additional text or context for the question.


Ensure that identical queries return consistent results.

Please correctly answer the following questions.


{few-shot demos}

Q: wealthiest people in the world in 2022.


A: Elon Musk, Bernard Arnault, Jeff Bezos, Mark Zuckerberg, Larry Ellison, Larry Page,
Sergey Brin, Warren Buffett, Bill Gates, Steve Ballmer

Q: fastes speed of land animals.


A: Cheetah, Pronghorn Antelope, Springbok, Wildebeest, Lion, Blackbuck,
American Quarter Horse, Brown Hare, Greyhound, Kangaroo

Q: Most point-scoring teams in the Bundesliga for the 2020-2021 season


A: Bayern Munich, RB Leipzig, Borussia Dortmund, Wolfsburg, Eintracht Frankfurt,
Bayer Leverkusen, Union Berlin, Borussia Mönchengladbach, VfB Stuttgart, SC Freiburg

Q: [Question]

C QQ-Plots for Normality Test

10
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

Figure 5: Displaying QQ-plots for test of normality. QQ-plots are depicted for Jaccard Ranked and Unranked for both
objective and subjective questions.

D Boxplot for Similarity Measures at Different Temperatures

11
Evaluating the Consistency of ChatGPT Responses Across Repeated Prompts A P REPRINT

Figure 6: Boxplot showing similarity measures from Objective questions responded with two different temperatures:
0.2 and 0.7. The images shows a clear difference in the accuracy.

12

You might also like