Professional Documents
Culture Documents
ECS260 Project Progress Report
ECS260 Project Progress Report
ECS260 Project Progress Report
1 Introduction
In the flourishing field of artificial intelligence (AI), the focal point of recent advancements, applications, and innovation
has been on Large Language Models(LLMs). LLMs are essentially complex models trained on extensive text datasets that
enable them to generate, comprehend, and interact using human-like language. LLMs have been shown to closely relate
to human-like responses by the looks of them, but with such close resemblance comes a need to effectively evaluate these
responses. Studies in prompt engineering ([1]) have highlighted the complexity of evaluation in this field. They emphasize that
prompt evaluation is opportunistic, requires human input, and needs to be studied for multiple criteria to derive meaningful
results. As also indicated by the experiments conducted by Novikova et al.(cite), there is a need for a new set of evaluation
methodologies for effectively evaluating LLM responses as the prior state-of-the-art metrics might not fit the task perfectly.
The authors et al.(cite) showed that automatic metrics might show a good correlation with human judgments at the system
level but show poor correlation on the instance level. something et al. (cite ) also shows how different instances and different
systems show better correlation with a different set of metrics. Thus choosing a good set of evaluation metrics is dataset,
system and instance specific. These conclusions drawn from the previous SOTA work have motivated us to provide a system
for developers/LLM-users to experiment with datasets/software-related surveys to analyze the best set of metrics that can
evaluate the prompts and make necessary changes to the prompt to boost certain aspects to the direction of the users need. This
project aims to enhance the evaluation of Large Language Models (LLMs) by providing a tool for experts to analyze responses
and refine prompts. Users can measure aspects like lexical diversity, readability, formality, n-gram overlap and more. We’re
currently focusing on evaluating LLM responses for technical question-answering tasks. This approach can be expanded to
include more metrics, datasets, and NLG tasks in the future.
1.1 Goals
• Propose a methodology to effectively evaluate LLM responses for the task of question-answering by involving a
human to assess and change results based on intent.
• Enable the user to adjust prompts to drive the LLM towards desirable results.
• Explore the applications and effectiveness of LLMs for empirical software engineering-based questions and compare
them with developer answers to judge human correlation.
• Choosing an effective set of evaluation metrics that is dataset, system, and instance specific.
Integration of Human Feedback During Iterative Training Process: This concern is clearly addressed in the study by
Ouyang et al.[[4]], which discusses how to include human feedback into the LLM training process. Their method of improving
LLM outputs through human evaluations emphasizes how important it is to include human judgment in the iterative training
cycles in order to improve the accuracy and dependability of the model. This technique makes sure that the LLMs are more in
line with human expectations, which improves their dependability when it comes to answering surveys and comprehending and
responding to natural language inquiries. This is in line with the research topic about how LLM responses in natural language
understanding tasks are impacted by human input during training, indicating an improvement in performance through closer
alignment with human evaluating norms.
Responses to Software Engineering Problem-Solving Surveys: Gerosa et al.[[2]] draw attention to the use of LLMs such as
ChatGPT in producing artificially generated qualitative data for software engineering studies. This work shows how LLMs can
generate synthetic data to provide useful insights in sociotechnical domains, even though it does not directly compare tailored
rapid engineering techniques with zero-shot inferencing. The application of LLMs in this context implies that responses
may be more precise and correspond to particular terminology, like those used in software development, provided prompts
or in-context examples were designed. The research suggests the significance of question framing (prompt engineering) in
2
eliciting valuable responses from LLMs for difficult problem-solving in software engineering, even though it doesn’t explicitly
address comparing various prompting strategies.
Metrics/Methodologies for Comparing LLMs’ Responses to Human Responses: The study by Kamalloo et al.[[3]] offers a
thorough examination of open-domain QA models, such as LLMs, and goes beyond the use of conventional lexical matching
metrics like F1 score and exact match (EM). Their strategy addresses the need for metrics that can capture semantic equivalency
and answer variations, in line with the research question on efficient methodologies for comparing LLM responses with human
responses. It involves manual assessment and the use of a BERT-based model (BEM) for supervision. The results of this
study highlight the shortcomings of conventional metrics and the importance of including contextual relevance and semantic
comprehension in assessment frameworks. Furthermore, the study by Hämäläinen et al.[[5]] highlights the need for validation
and the potential differences between LLM-generated and real responses, and highlights the difficulty of comparing synthetic
LLM responses with real human data. It also looks at the generation of synthetic data for HCI experiences.
Questionnaire Dataset: The dataset is designed for enhancing Large Language Models (LLMs) by incorporating demo-
graphic data into software engineering scenarios. It includes structured questionnaire responses on demographic details such as
age, gender, ethnicity, education, and professional experience, all numerically encoded for analysis. Additionally, it captures
software engineers’ preferences regarding development environments, learning methods, and language selection criteria.
Challenges such as work-life balance and keeping up with technological advancements are also addressed. The intent is to
use this data for ’prompt engineering,’ enabling LLMs to simulate responses as if they were software engineers with specific
demographic profiles, thus creating personalized and contextually relevant interactions.
Profile Generation Methodology: Our study initiated with the generation of distinct user profiles, incorporating variables
such as age, gender, ethnicity, educational background, and work experience. For example a profile could be an individual aged
18-22 years, identifying as male, of Asian ethnicity, holding a Bachelor’s degree, and possessing less than a year of coding
experience. A comprehensive list of profiles was methodically created using the "product" function from Python’s itertools
library, facilitating an extensive coverage across a diverse demographic spectrum.
Figure 1: Relation between Age, Gender and biggest challenge faced as software engineer
The data collected was unbiased towards the work of experience, the major outcome from the analysis was early career
engineers i.e., 0-5 years of work experience find it difficult in allocating enough time for innovation and meeting deadlines.
where by senior developers focus on both deadlines and innovations and interestingly, the most seasoned professionals with
over 10 years of experience also allocate specific time for research and innovation, suggesting a consistent approach towards
integrating innovative processes within their work schedule as they gain more experience.
Generation of Responses Using LLM: Utilizing the profiles generated, we crafted varied prompts, such as: "Imagine you
are aged 18-22 years, male, Asian, with a Bachelor’s degree and less than a year of coding experience. Please respond to
3
Figure 2: Relation between ethnicity and challenge to achieve innovation or deadlines
Figure 3: Relation between work experiences and challenges faced relted to innovation vs deadlines
4
the following questions-" This was followed by a sequence of questions and its options. The prompts used in generating the
responses are mentioned in the appendix below (cite).
Observations and Improvements in Response Generation: During the zero-shot inferencing process, the model exhibited
tendencies of hallucination, resulting in verbose and occasionally off-topic responses. This issue was notably mitigated by
introducing an explicit example into the prompt, guiding the LLM to adhere more closely to the expected response format. For
instance, we refined the prompt to include specific instructions on response format, such as: "Imagine you are a 23-26-year-old,
female, Asian, with a Master’s degree and 3-5 years of coding experience. Respond to this question by selecting from the
given options only. For example, if asked about your preferred development environment, your response should be formatted
as follows: option_3: Linux." This adjustment led to a marked improvement in the relevance and precision of the LLM’s
responses answering the second research question.
Data explanation: serves as a question-answering dataset specifically designed for scientific research papers. The dataset
was created with each question formulated by an NLP practitioner who exclusively reviewed the paper’s title and abstract.
These questions are crafted to extract information found within the contents of the paper. Subsequently, a distinct group of
NLP practitioners answers the questions, providing gold references for our evaluation.
Data Preprocessing: The contents of the paper were passed to the LLM as context to answer the questions. The full contents
of the research paper being too verbose exceeded the token limit, so the "evidence" feature, which contains all paragraphs
relevant to the question from the full text was used as context. The context was converted to lower case, punctuations were
removed and text was lemmatized to provide clean context to the LLM. The answers given by human expert annotators were
used as gold references to calculate evaluation metrics.
To answer the third research question, we have used the following evaluation metrics so far:
n-gram overlap based metrics: BLEU variants which measures the precision of word n-grams between generated and
reference texts, ROGUE variants measures recall of word n-grams and longest common sequences and METOER which is
based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also incorporates
additional semantic matching based on stems and paraphrasing.
Embedding based distance metrics: Bert score, which uses BERT embeddings for calculating the cosine distance between
the reference and candidate text embeddings
Stylistic metrics: Readability, which is evaluated using Flesch Kincaid Reading Ease Index. It measures the complexity of a
text based on sentence length and syllable count. The score indicates the approximate educational level a person will need to
be able to read a particular text easily. Formality, it is a quantitative measure designed to assess the level of formality exhibited
in a generated summary. Formality in language refers to the degree of adherence to formal linguistic structures, conventions,
and vocabulary. This metric considers the usage of nouns, adjectives, articles and prepositions more frequently than pronouns,
verbs, adverbs and interjections. It also penalizes usage of deictic works like “I”, “here”, “now”, etc. and instead rewards
specific references such as names, dates, locations, etc.
5
Where, NNF is noun frequescy, ADJF is adjective frequency, PREPF is preposition frequency, VBF is verb frequency, ADVF
is adverb frequency and INTF is interjection frequency
The formality score varies from 0-100, with higher scores indicating higher formality.
4 Milestones
4.1 Already done
1. Figure 1 Observations:
• Survey demographics include 61 participants with 44 males and 17 females.
• Early-career engineers (ages 18-26) primarily struggle with time management and staying updated with technol-
ogy, whereas senior engineers face challenges with project deadlines and legacy codebases.
2. Figure 2 Observations:
• Ethnic distribution is skewed towards Asians (45) and Indians (4) out of 61 respondents.
• Across demographics, particularly among females, keeping pace with technology and managing time for
innovation are significant challenges. Whites and Pacific Islanders additionally indicate difficulties in work-life
balance and codebase understanding.
3. Figure 3 Observations:
• Work experience in the dataset shows no bias and spans across different career stages.
• Engineers with 0-5 years of experience mainly report time management challenges for innovation and meeting
deadlines. More seasoned professionals, with over 10 years of experience, consistently allocate time for research
and innovation.
4. Figure 4 Observations:
• Responses based on education level show that Bachelor’s degree holders often seek help from knowledgeable
colleagues. Master’s degree holders tend to learn new languages beforehand, while professional degree holders
prefer adapting on the fly. Reallocating the project is the least popular approach.
5. Figure 5 Observations:
• The preferred development environment among the majority is Windows, especially for those aged 23-26. macOS
is the second choice, favored by respondents aged 27-35. Linux is consistently chosen by individuals between
23-35 years old, and one respondent aged 47 is comfortable with both macOS and Linux.
LLM Used: For the generation of responses to our set of questions which is a predefined list of dictionaries having questions,
and its associated options as key-value pairs, we employed OpenAI’s ’gpt-3.5-turbo’ model, operating under a zero-shot
inferencing framework. This approach was pivotal in assessing the model’s capability to generate relevant responses without
prior specific training on the questionnaire context.
• When comparing zero-shot and few-shot inferencing, zero-shot responses often include extra text. For instance, when
asked about the size of the ANTISCAM dataset, the zero-shot response was “The ANTISCAM dataset consists of
220 human-human dialogs,” while the more concise few-shot response was “220 human-human dialogs.” This shows
that the few-shot strategy adheres more closely to our prompt for brief, exact answers.
• The metric analysis showed that n-gram overlap-based metrics capture the quality of responses very effectively when
the answer is exactly extracted from the context provided. When the LLM answer is something derived from the
context and not an exact match, the automatic metrics show less correlation with humans.
• Non-reference and non-n-gram overlap metrics are effective for free text evaluation, but automatic metrics remain
crucial for fact-based answers that are directly extracted from the context. Further analysis will refine the metric set.
and
• Stylistic-based metrics can be ignored for exact answers from the context as they are concise, but they are very critical
to recognize hallucinated outputs as the readability and formality decrease with verbose answers. These metrics are
useful for free text answers not directly found in the context.
6
• Human responses were gathered via Google Forms, and these were utilized to craft tailored prompts, incorporating
demographic data to simulate responses as though they were produced by a Language Learning Model (LLM) in
place of actual participants. To decipher the relationship between demographic factors and the responses provided by
both humans and the LLM, visual analyses were conducted using bar graphs among other methods. Observations
from these analyses have shed light on our central research query: the feasibility of employing LLMs as alternatives
to conventional survey data collection methods.
we plan to get more data. data has biases(asians and indians maximum). prompts need to be revised Our survey data seems to
be biased in certain aspects like ethnicity, gender, age, etc. To tackle this problem, we are planning to circulate this survey in
various other diverse communities that will help us gather better quality data.
The current prompts used on the LLMs
• Week 1: Designing a questionnaire, conducting a further literature review, and making video presentation.
• Week 2: Collecting data and data pre-processing, Designing LLM inputs.
• Week 3:
This timeline includes the room for inclusion/exclusion of several ideas that we are assuming to encounter while the project
takes shape.
References
[1] Ellen Jiang et al. “Promptmaker: Prompt-based prototyping with large language models”. In: CHI Conference on Human
Factors in Computing Systems Extended Abstracts. 2022, pp. 1–8.
[2] Marco Gerosa et al. “Can AI serve as a substitute for human subjects in software engineering research?” In: Automated
Software Engineering 31.1 (2024), p. 13.
7
[3] Ehsan Kamalloo et al. “Evaluating Open-Domain Question Answering in the Era of Large Language Models”. In: arXiv
preprint arXiv:2305.06984 (2023).
[4] Long Ouyang et al. “Training language models to follow instructions with human feedback”. In: Advances in Neural
Information Processing Systems 35 (2022), pp. 27730–27744.
[5] Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. “Evaluating Large Language Models in Generating Synthetic HCI
Research Data: a Case Study”. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.
CHI ’23. <conf-loc>, <city>Hamburg</city>, <country>Germany</country>, </conf-loc>: Association for Computing
Machinery, 2023. ISBN: 9781450394215. DOI: 10.1145/3544548.3580688. URL: https://doi.org/10.1145/
3544548.3580688.
[6] Anna-Marie Ortloff et al. “Different Researchers, Different Results? Analyzing the Influence of Researcher Experience
and Data Type During Qualitative Analysis of an Interview and Survey Study on Security Advice”. In: Proceedings of
the 2023 CHI Conference on Human Factors in Computing Systems. 2023, pp. 1–21.