ECS260 Project Progress Report

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

ECS 260 Project Progress Report: Assessing the Effectiveness of

Large Language Models as Polling Participants in Qualitative


Research

1 Introduction

In the flourishing field of artificial intelligence (AI), the focal point of recent advancements, applications, and innovation
has been on Large Language Models(LLMs). LLMs are essentially complex models trained on extensive text datasets that
enable them to generate, comprehend, and interact using human-like language. LLMs have been shown to closely relate
to human-like responses by the looks of them, but with such close resemblance comes a need to effectively evaluate these
responses. Studies in prompt engineering ([1]) have highlighted the complexity of evaluation in this field. They emphasize that
prompt evaluation is opportunistic, requires human input, and needs to be studied for multiple criteria to derive meaningful
results. As also indicated by the experiments conducted by Novikova et al.(cite), there is a need for a new set of evaluation
methodologies for effectively evaluating LLM responses as the prior state-of-the-art metrics might not fit the task perfectly.
The authors et al.(cite) showed that automatic metrics might show a good correlation with human judgments at the system
level but show poor correlation on the instance level. something et al. (cite ) also shows how different instances and different
systems show better correlation with a different set of metrics. Thus choosing a good set of evaluation metrics is dataset,
system and instance specific. These conclusions drawn from the previous SOTA work have motivated us to provide a system
for developers/LLM-users to experiment with datasets/software-related surveys to analyze the best set of metrics that can
evaluate the prompts and make necessary changes to the prompt to boost certain aspects to the direction of the users need. This
project aims to enhance the evaluation of Large Language Models (LLMs) by providing a tool for experts to analyze responses
and refine prompts. Users can measure aspects like lexical diversity, readability, formality, n-gram overlap and more. We’re
currently focusing on evaluating LLM responses for technical question-answering tasks. This approach can be expanded to
include more metrics, datasets, and NLG tasks in the future.

1.1 Goals
• Propose a methodology to effectively evaluate LLM responses for the task of question-answering by involving a
human to assess and change results based on intent.
• Enable the user to adjust prompts to drive the LLM towards desirable results.
• Explore the applications and effectiveness of LLMs for empirical software engineering-based questions and compare
them with developer answers to judge human correlation.
• Choosing an effective set of evaluation metrics that is dataset, system, and instance specific.

1.2 Research Question


1. How does the integration of human feedback during the iterative training process affect the accuracy and
reliability of an LLM’s responses in the context of natural language understanding tasks? Following the
conclusions of [2], evaluation without any human influence can have ethical implications and can ignore important
human insights/intent. Thus experimenting with a human in the loop can lead to better and more human-like
evaluation.
2. In what ways do responses from an LLM to software engineering problem-solving surveys differ when using
tailored prompt engineering techniques, such as customized in-context examples versus zero-shot inferencing,
in terms of precision and adherence to software development terminology? After a literature review and exploring
related works in this domain, a conclusion of scarce evaluation strategies for software engineering-based questions
was drawn. Thus our data collection strategies focus on getting real developer responses.
3. What set of metrics/methodologies can effectively quantify the comparison of LLMs’ responses to human
responses? As stated in the introduction above, automatic metrics show a poor correlation to human judgment.
Therefore we plan on experimenting with other stylistic metrics as well to assess other aspects of the responses to
provide a holistic parallel comparison.
2 Literature review
We reviewed several papers while conducting our research. This section will help summarize the important literature relevant
to our research.
Gerosa et al[2] highlights the application of LLMs, such as ChatGPT, for generating synthetic qualitative data in sociotechnical
domains, particularly in software engineering research.
In the paper Kamalloo et al.[[3]], the authors are addressing the problem of evaluating open-domain question-answering (QA)
models in the era of large language models (LLMs) that can generate long and plausible answers. The prior state-of-the-art
solution to the problem was to use lexical matching metrics such as exact match (EM) and F1 score to compare the candidate
answers with a list of gold answers provided by a benchmark dataset. However, these metrics are too rigid and do not account
for semantic equivalence, answer variations, and data quality issues. In this paper, the authors systematically examine different
open-domain QA models, including LLMs, through a manual assessment of their responses on a subset of NQ-OPEN, a widely
used benchmark. They use BEM for supervision, which is a BERT-based model trained on a human-annotated collection
of question-answer pairs. For zero shot evaluation via prompting they prompt InstructGPT to evaluate the correctness of a
candidate answer. Two human annotators were asked to judge if the answer to a question was correct. They concluded that the
zero-shot and few-shot InstructGPT models had the best rise in accuracy.
In the paper by Ouyang et al.[[4]], addresses the issue of matching LLM results to human expectations head-on by presenting a
system that integrates training with human feedback. This method increases the dependability of LLMs as survey respondents
by ensuring that they can comprehend and react to survey questions more accurately when they are refined based on human
evaluation replies. While alignment has progressed, this study does not adequately investigate how to gather and optimize
high-quality data for LLM training or thorough assessment benchmarks.
In a study by Hämäläinen et al. [[5]], they use OpenAI’s GPT-3 model to create open-ended questionnaire responses related to
experiencing video games as art—a topic that is not easily addressed using traditional computational user models. The study
examines content similarities between synthetic and actual data, evaluates faults in synthetic data, and determines if synthetic
replies can be discriminated from real ones. The authors come to the conclusion that GPT-3 can generate credible descriptions
of HCI experiences in this particular setting. Synthetic data might be useful for ideation and experiment piloting because of
the low cost and quick synthesis of LLM data. Findings derived from synthetic data, however, should always be confirmed
using actual data. The study also highlights questions regarding the validity of crowdsourced self-report data in the event that
unscrupulous individuals use LLMs to generate data.
With an emphasis on security advice studies, Anna-Marie Ortloff et al.’s[[6]] work "Different Researchers, Different Results?"
examines the effects of researcher experience and data type (interviews vs. surveys) on qualitative research findings. It
highlights the inherent subjectivity in qualitative research by showing how the type of data and the level of researcher
experience both have a major impact on the analysis. According to the study, seasoned researchers may use their wider
knowledge to analyze data in a different way than less experienced ones, and interviews’ interactive format can produce more
insightful answers than surveys’ organized ones. The study emphasizes the unpredictability in qualitative analysis and asks
for strict methodological procedures to lessen the impact of these biases. The authors make the case for a more reflective
and open-minded qualitative research approach by acknowledging and directly addressing the possible impacts of researcher
experience and data format. By improving the validity and reliability of results, particularly in important domains like security
guidance, this strategy seeks to guarantee that research findings are reliable and strong.

2.1 How do these papers help our study?

Integration of Human Feedback During Iterative Training Process: This concern is clearly addressed in the study by
Ouyang et al.[[4]], which discusses how to include human feedback into the LLM training process. Their method of improving
LLM outputs through human evaluations emphasizes how important it is to include human judgment in the iterative training
cycles in order to improve the accuracy and dependability of the model. This technique makes sure that the LLMs are more in
line with human expectations, which improves their dependability when it comes to answering surveys and comprehending and
responding to natural language inquiries. This is in line with the research topic about how LLM responses in natural language
understanding tasks are impacted by human input during training, indicating an improvement in performance through closer
alignment with human evaluating norms.
Responses to Software Engineering Problem-Solving Surveys: Gerosa et al.[[2]] draw attention to the use of LLMs such as
ChatGPT in producing artificially generated qualitative data for software engineering studies. This work shows how LLMs can
generate synthetic data to provide useful insights in sociotechnical domains, even though it does not directly compare tailored
rapid engineering techniques with zero-shot inferencing. The application of LLMs in this context implies that responses
may be more precise and correspond to particular terminology, like those used in software development, provided prompts
or in-context examples were designed. The research suggests the significance of question framing (prompt engineering) in

2
eliciting valuable responses from LLMs for difficult problem-solving in software engineering, even though it doesn’t explicitly
address comparing various prompting strategies.
Metrics/Methodologies for Comparing LLMs’ Responses to Human Responses: The study by Kamalloo et al.[[3]] offers a
thorough examination of open-domain QA models, such as LLMs, and goes beyond the use of conventional lexical matching
metrics like F1 score and exact match (EM). Their strategy addresses the need for metrics that can capture semantic equivalency
and answer variations, in line with the research question on efficient methodologies for comparing LLM responses with human
responses. It involves manual assessment and the use of a BERT-based model (BEM) for supervision. The results of this
study highlight the shortcomings of conventional metrics and the importance of including contextual relevance and semantic
comprehension in assessment frameworks. Furthermore, the study by Hämäläinen et al.[[5]] highlights the need for validation
and the potential differences between LLM-generated and real responses, and highlights the difficulty of comparing synthetic
LLM responses with real human data. It also looks at the generation of synthetic data for HCI experiences.

3 Data and Methods

3.1 Questionnaire Analysis

Questionnaire Dataset: The dataset is designed for enhancing Large Language Models (LLMs) by incorporating demo-
graphic data into software engineering scenarios. It includes structured questionnaire responses on demographic details such as
age, gender, ethnicity, education, and professional experience, all numerically encoded for analysis. Additionally, it captures
software engineers’ preferences regarding development environments, learning methods, and language selection criteria.
Challenges such as work-life balance and keeping up with technological advancements are also addressed. The intent is to
use this data for ’prompt engineering,’ enabling LLMs to simulate responses as if they were software engineers with specific
demographic profiles, thus creating personalized and contextually relevant interactions.

Profile Generation Methodology: Our study initiated with the generation of distinct user profiles, incorporating variables
such as age, gender, ethnicity, educational background, and work experience. For example a profile could be an individual aged
18-22 years, identifying as male, of Asian ethnicity, holding a Bachelor’s degree, and possessing less than a year of coding
experience. A comprehensive list of profiles was methodically created using the "product" function from Python’s itertools
library, facilitating an extensive coverage across a diverse demographic spectrum.

Figure 1: Relation between Age, Gender and biggest challenge faced as software engineer

The data collected was unbiased towards the work of experience, the major outcome from the analysis was early career
engineers i.e., 0-5 years of work experience find it difficult in allocating enough time for innovation and meeting deadlines.
where by senior developers focus on both deadlines and innovations and interestingly, the most seasoned professionals with
over 10 years of experience also allocate specific time for research and innovation, suggesting a consistent approach towards
integrating innovative processes within their work schedule as they gain more experience.

Generation of Responses Using LLM: Utilizing the profiles generated, we crafted varied prompts, such as: "Imagine you
are aged 18-22 years, male, Asian, with a Bachelor’s degree and less than a year of coding experience. Please respond to

3
Figure 2: Relation between ethnicity and challenge to achieve innovation or deadlines

Figure 3: Relation between work experiences and challenges faced relted to innovation vs deadlines

Figure 4: Preferred approaches to unfamiliar programming language tasks by educational level

Figure 5: Distribution of preferred development environments among different age groups.

4
the following questions-" This was followed by a sequence of questions and its options. The prompts used in generating the
responses are mentioned in the appendix below (cite).

Figure 6: Data demographics of LLM generated responses

Observations and Improvements in Response Generation: During the zero-shot inferencing process, the model exhibited
tendencies of hallucination, resulting in verbose and occasionally off-topic responses. This issue was notably mitigated by
introducing an explicit example into the prompt, guiding the LLM to adhere more closely to the expected response format. For
instance, we refined the prompt to include specific instructions on response format, such as: "Imagine you are a 23-26-year-old,
female, Asian, with a Master’s degree and 3-5 years of coding experience. Respond to this question by selecting from the
given options only. For example, if asked about your preferred development environment, your response should be formatted
as follows: option_3: Linux." This adjustment led to a marked improvement in the relevance and precision of the LLM’s
responses answering the second research question.

3.2 Qasper Dataset Analysis

Data explanation: serves as a question-answering dataset specifically designed for scientific research papers. The dataset
was created with each question formulated by an NLP practitioner who exclusively reviewed the paper’s title and abstract.
These questions are crafted to extract information found within the contents of the paper. Subsequently, a distinct group of
NLP practitioners answers the questions, providing gold references for our evaluation.

Data Preprocessing: The contents of the paper were passed to the LLM as context to answer the questions. The full contents
of the research paper being too verbose exceeded the token limit, so the "evidence" feature, which contains all paragraphs
relevant to the question from the full text was used as context. The context was converted to lower case, punctuations were
removed and text was lemmatized to provide clean context to the LLM. The answers given by human expert annotators were
used as gold references to calculate evaluation metrics.
To answer the third research question, we have used the following evaluation metrics so far:
n-gram overlap based metrics: BLEU variants which measures the precision of word n-grams between generated and
reference texts, ROGUE variants measures recall of word n-grams and longest common sequences and METOER which is
based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also incorporates
additional semantic matching based on stems and paraphrasing.
Embedding based distance metrics: Bert score, which uses BERT embeddings for calculating the cosine distance between
the reference and candidate text embeddings
Stylistic metrics: Readability, which is evaluated using Flesch Kincaid Reading Ease Index. It measures the complexity of a
text based on sentence length and syllable count. The score indicates the approximate educational level a person will need to
be able to read a particular text easily. Formality, it is a quantitative measure designed to assess the level of formality exhibited
in a generated summary. Formality in language refers to the degree of adherence to formal linguistic structures, conventions,
and vocabulary. This metric considers the usage of nouns, adjectives, articles and prepositions more frequently than pronouns,
verbs, adverbs and interjections. It also penalizes usage of deictic works like “I”, “here”, “now”, etc. and instead rewards
specific references such as names, dates, locations, etc.

(N N F +ADJF +P REP F +ART F −P P N F −V BF −ADV F −IN T F +100)


Formality Score = 2

5
Where, NNF is noun frequescy, ADJF is adjective frequency, PREPF is preposition frequency, VBF is verb frequency, ADVF
is adverb frequency and INTF is interjection frequency
The formality score varies from 0-100, with higher scores indicating higher formality.

4 Milestones
4.1 Already done

Visualization and Observations of Questionnaire:

1. Figure 1 Observations:
• Survey demographics include 61 participants with 44 males and 17 females.
• Early-career engineers (ages 18-26) primarily struggle with time management and staying updated with technol-
ogy, whereas senior engineers face challenges with project deadlines and legacy codebases.
2. Figure 2 Observations:
• Ethnic distribution is skewed towards Asians (45) and Indians (4) out of 61 respondents.
• Across demographics, particularly among females, keeping pace with technology and managing time for
innovation are significant challenges. Whites and Pacific Islanders additionally indicate difficulties in work-life
balance and codebase understanding.
3. Figure 3 Observations:
• Work experience in the dataset shows no bias and spans across different career stages.
• Engineers with 0-5 years of experience mainly report time management challenges for innovation and meeting
deadlines. More seasoned professionals, with over 10 years of experience, consistently allocate time for research
and innovation.
4. Figure 4 Observations:
• Responses based on education level show that Bachelor’s degree holders often seek help from knowledgeable
colleagues. Master’s degree holders tend to learn new languages beforehand, while professional degree holders
prefer adapting on the fly. Reallocating the project is the least popular approach.
5. Figure 5 Observations:
• The preferred development environment among the majority is Windows, especially for those aged 23-26. macOS
is the second choice, favored by respondents aged 27-35. Linux is consistently chosen by individuals between
23-35 years old, and one respondent aged 47 is comfortable with both macOS and Linux.

LLM Used: For the generation of responses to our set of questions which is a predefined list of dictionaries having questions,
and its associated options as key-value pairs, we employed OpenAI’s ’gpt-3.5-turbo’ model, operating under a zero-shot
inferencing framework. This approach was pivotal in assessing the model’s capability to generate relevant responses without
prior specific training on the questionnaire context.

Qasper analysis Observations:

• When comparing zero-shot and few-shot inferencing, zero-shot responses often include extra text. For instance, when
asked about the size of the ANTISCAM dataset, the zero-shot response was “The ANTISCAM dataset consists of
220 human-human dialogs,” while the more concise few-shot response was “220 human-human dialogs.” This shows
that the few-shot strategy adheres more closely to our prompt for brief, exact answers.
• The metric analysis showed that n-gram overlap-based metrics capture the quality of responses very effectively when
the answer is exactly extracted from the context provided. When the LLM answer is something derived from the
context and not an exact match, the automatic metrics show less correlation with humans.
• Non-reference and non-n-gram overlap metrics are effective for free text evaluation, but automatic metrics remain
crucial for fact-based answers that are directly extracted from the context. Further analysis will refine the metric set.
and
• Stylistic-based metrics can be ignored for exact answers from the context as they are concise, but they are very critical
to recognize hallucinated outputs as the readability and formality decrease with verbose answers. These metrics are
useful for free text answers not directly found in the context.

6
• Human responses were gathered via Google Forms, and these were utilized to craft tailored prompts, incorporating
demographic data to simulate responses as though they were produced by a Language Learning Model (LLM) in
place of actual participants. To decipher the relationship between demographic factors and the responses provided by
both humans and the LLM, visual analyses were conducted using bar graphs among other methods. Observations
from these analyses have shed light on our central research query: the feasibility of employing LLMs as alternatives
to conventional survey data collection methods.

4.2 What is left to achieve


• Experiment with more prompt engineering and include more metrics to come up with the best metric set for evaluation.
• Generate more diversity in the LLM answers to the questionnaire depending on the coding experience,
• Conduct a thorough evaluation of alternative language models or fine-tuning techniques to mitigate hallucination
tendencies and improve response quality. Experiment with different zero-shot frameworks or pre-trained models to
assess their suitability for generating contextually accurate responses.
• Iterate on prompt structures to provide clearer guidance and minimize ambiguity for the language model. Explore
methods such as providing multiple examples or incorporating conditional logic to steer the model towards more
relevant and concise responses.
• Explore additional demographic variables or refine existing ones to enhance the diversity and specificity of generated
user profiles.
• To answer the first research question, we have to conduct thorough testing from different users to measure the
effectiveness of involving human input for evaluating LLM’s.

we plan to get more data. data has biases(asians and indians maximum). prompts need to be revised Our survey data seems to
be biased in certain aspects like ethnicity, gender, age, etc. To tackle this problem, we are planning to circulate this survey in
various other diverse communities that will help us gather better quality data.
The current prompts used on the LLMs

6 Timeline for the next 3 weeks


Although we do have a timeline and a deadline for the project, the milestones may change depending on the needs of this
project as we explore different sections of this research. The following list provides a step-wise timeline of the research:

• Week 1: Designing a questionnaire, conducting a further literature review, and making video presentation.
• Week 2: Collecting data and data pre-processing, Designing LLM inputs.
• Week 3:

This timeline includes the room for inclusion/exclusion of several ideas that we are assuming to encounter while the project
takes shape.

7 Team Membership and Attestation


The team consists of Tarun Tiwari(ttiwari@ucdavis.edu), Aryaman Bahukhandi(abahukhandi@ucdavis.edu), Kshitij
Sinha(kxsinha@ucdavis.edu), Abrar Syed(abrsyed@ucdavis.edu) and Radhika Gupta(rkagupta@ucdavis.edu)
This is to attest that everyone’s contribution is equally important to the success of this project. The roles taken by everyone in
the team may switch while we move through different phases of the project.

References
[1] Ellen Jiang et al. “Promptmaker: Prompt-based prototyping with large language models”. In: CHI Conference on Human
Factors in Computing Systems Extended Abstracts. 2022, pp. 1–8.
[2] Marco Gerosa et al. “Can AI serve as a substitute for human subjects in software engineering research?” In: Automated
Software Engineering 31.1 (2024), p. 13.

7
[3] Ehsan Kamalloo et al. “Evaluating Open-Domain Question Answering in the Era of Large Language Models”. In: arXiv
preprint arXiv:2305.06984 (2023).
[4] Long Ouyang et al. “Training language models to follow instructions with human feedback”. In: Advances in Neural
Information Processing Systems 35 (2022), pp. 27730–27744.
[5] Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. “Evaluating Large Language Models in Generating Synthetic HCI
Research Data: a Case Study”. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.
CHI ’23. <conf-loc>, <city>Hamburg</city>, <country>Germany</country>, </conf-loc>: Association for Computing
Machinery, 2023. ISBN: 9781450394215. DOI: 10.1145/3544548.3580688. URL: https://doi.org/10.1145/
3544548.3580688.
[6] Anna-Marie Ortloff et al. “Different Researchers, Different Results? Analyzing the Influence of Researcher Experience
and Data Type During Qualitative Analysis of an Interview and Survey Study on Security Advice”. In: Proceedings of
the 2023 CHI Conference on Human Factors in Computing Systems. 2023, pp. 1–21.

You might also like