Professional Documents
Culture Documents
GPT Cps Annotator Submit
GPT Cps Annotator Submit
GPT Cps Annotator Submit
Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi and Lei Liu
Abstract
assessing CPS. This paper reports the findings on using ChatGPT to directly code CPS chat data
by benchmarking performance across multiple datasets and coding frameworks. We found that
ChatGPT-based coding outperformed human coding in tasks where the discussions were
characterized by colloquial languages but fell short in tasks where the discussions dealt with
specialized scientific terminology and contexts. The findings offer practical guidelines for
researchers to develop strategies for efficient and scalable analysis of communication data from
CPS tasks.
Email: jhao@ets.org
1
Introduction
crucial for unraveling the intricacies of group interactions, learning dynamics, and knowledge
construction processes. Verbal communication can take various forms, including text chat or
audio interactions. With the rapid advancement of AI technology, audios can be accurately
transcribed into text chat automatically at a high accuracy (Radford, et al., 2023). Therefore, the
primary focus for analyzing the communication data is around coding text chats, which serves as
a rich source of insights into collaborative interactions and cognitive processes. Historically, this
analysis has leaned heavily on the painstaking efforts of human coders, who have meticulously
applied detailed coding frameworks to categorize and interpret the data. Once a portion of the
data has been reliably human-coded (e.g., 10 - 25% of the total dataset; Campbell et al., 2013),
automated annotation methods can be developed using numerical representations of the chats
(e.g., n-grams or neural embeddings) and supervised machine learning classifiers (Flor et al.,
2016; Hao et al., 2017; Moldovan, Rus, & Graesser, 2011; Rosé et al., 2008). Nevertheless, this
pipeline hinges on high-quality human-coded data, which, despite constituting only a fraction of
the total dataset, remains a time-consuming and labor-intensive part of the process. This inherent
limitation significantly constrains the scope and scalability of research projects in this domain.
As the volume of digital communication data continues to grow, there is a pressing need
for more efficient methods that can handle large datasets without compromising the depth and
processing have opened new possibilities in this domain, with ChatGPT emerging as a
particularly promising tool (OpenAI, 2023). Unlike previous automated coding methods, which
2
required a substantial amount of human-coded data to train machine learning models, ChatGPT
can be directly instructed to apply various coding frameworks to code chats directly, bypassing
the need for extensive human involvement in coding. This application falls under the category
known as zero-shot or few-shot classification in natural language processing (NLP). While the
concept is straightforward, the true measure of its effectiveness lies in its performance in real-
world scenarios, which is intricately tied to the complexity of the dataset and the underlying
coding framework employed. For straightforward tasks like sentiment analysis, studies have
indicated that ChatGPT generally exhibits good accuracy, albeit with some variance across
datasets (Belal et al., 2023; Fatouros et al., 2023). However, when confronted with more
complex coding challenges, ChatGPT often falls short of meeting expectations (Kocoń et al.,
2023).
This paper presents an empirical study aimed at assessing the effectiveness of employing
ChatGPT for directly coding CPS chat data. By conducting an evaluation of its coding
performance across four datasets and two coding frameworks, our results demonstrate that
ChatGPT, when prompted properly, can achieve coding quality comparable to that of human
coders for CPS tasks where there is not much technical context knowledge. The empirical
findings highlight that harnessing ChatGPT for coding text chat data in some CPS scenarios is
not only viable but also significantly reduces costs and time constraints, thereby opening up
promising avenues for accelerating and expanding research in fields reliant on the analysis of
communication data.
Methodology
In this section, we introduce the details about the data, coding framework, large language
3
Data
The chat data utilized in this study were from four distinct collaborative problem-solving
tasks administered through the ETS Platform for Collaborative Assessment and Learning
(EPCAL, Hao et al., 2017). The first dataset stemmed from dyadic online collaborative sessions
centered on science tasks. These tasks encompass ten multiple-choice and constructed response
items, alongside a simulation-based science task (Hao et al., 2015). The second through the
fourth datasets were derived from three different types of tasks, negotiation, decision making and
problem solving, administered to 4-person teams (Kyllonen, et al., 2023). Figure 1 shows the
screenshots of these tasks. All these chat data were collected through crowdsourcing, e.g., from
Amazon Mechanical Turk (https://www.mturk.com/) for the science task and Prolific
(https://www.prolific.com/) for the remaining three tasks. In this study, we choose several teams’
collaborative sessions totaling about 1500 chat turns for each task. These chats all have been
The demographic distribution of the dataset sample was as follows. In the science task
dataset, 51% were female, 80% identified as White, 10% Asian, 5% Black or African American,
and 5% Hispanic or Latino. In the negotiation task sample, 42% were female, 69% identified as
White, 8% Black, 7% Asian, 8% Mixed, and 7% other or no response. The problem-solving task
sample was 64% female, 67% identified as White, 13% Black, 9% Asian, 6% Mixed, and 5%
other or no response. In the decision-making task sample, 62% were female, 81% identified as
Figure 1
4
(c) Proglem solving task (d) Decision making task
Coding Framework
Two coding frameworks were considered in this study. The first coding framework (Liu
et al., 2016) was built off an extensive review of computer supported collaborative learning
(CSCL) research findings (Barron, 2003; Dillenbourg & Traum, 2006; Griffin & Care, 2014), as
well as the PISA 2015 CPS Framework (Graesser & Foltz, 2013). The coding scheme includes
four major categories designed to assess both social and cognitive processes within group
interactions. The first category, denoted as “Sharing”, captures instances of how individual group
members introduce diverse ideas into collaborative discussions. For example, participants may
share their individual responses to assessment items or highlight pertinent resources that
5
include agreement/disagreement, requesting clarification, elaborating, or rephrasing others’
ideas, identifying cognitive gaps, and revising ideas. The third category, “Regulating”, focuses
on the collaborative regulation aspect of team discourse, including activities like identifying
team goals, evaluating teamwork, and checking understanding. The fourth category,
applied this coding framework to code the chat data from the science task.
The second coding framework combined the Liu et al. (2016) framework with one
conversations across three task types (negotiation, problem solving, decision making)
administered in Kyllonen et al. (2023). The goal in developing this second coding framework
was to develop a set of categories that could be interpreted comparably across the three diverse
tasks, which would allow us sensibly to explore the generalizability versus task-specificity of
categories and their definitions from Liu et al. (2016) and Martin-Raugh et al. (2020) but was
refined through experiences of five human annotators who coded a subset of conversations
independently, and then refined category definitions and resolved coding disagreements
iteratively until a consensus on category definitions and treatment of particular cases emerged,
The final framework comprised 5 top-level categories, each of which is associated with
two to five subcategories, leading to 17 subcategories in total. For this study we only consider
the 5 top-level categories. The categories were (a) maintaining communication (MC) (greetings,
emotional responses, technical discussions, and other), (b) staying on track (OT) (keep things
6
moving/monitoring time, steering team effort), (c) eliciting information (EI) (eliciting
information from another about the task; eliciting strategies, eliciting goals or opinions), (d)
sharing information (SI) (sharing information, sharing strategies, sharing goals or opinions), and
tradeoff, stating disagreement or rejecting a tradeoff, building off one’s own or a teammate’s
In Table 1 and 2, we show some example chats from the science and decision-making
tasks together with their corresponding codes by human coders using the corresponding coding
framework.
Table 1
Example Chats from the Science task and the Corresponding Skill Categories Under the First
Coding Framework.
Team
Chat Skill Category
Member
Person A hi Maintain
Person B hey Maintain
Person B Wait you're a real person? Maintain
Person A haha yea, i guess its just a coincidence Maintain
Person A Welp, i guess its gonna be a good day Maintain
Person B Easy 5 dollars Maintain
Person A so i put that they were both moving Share
Person B It's either C or A Share
Person B It's A Share
Sounds good to me, i think molecules are always
Person A
moving Negotiate
Person B Yeah Maintain
Person B Definitly B Share
Person A Well i think it depends on the temperature of the water Negotiate
if the water is all one temp then they are moving at the
Person A
same speed, or am i wrong Negotiate
Person B I'm not sure. Regulate
Person A wanna just go with c Regulate
7
Table 2
Example Chats from the Decision-making Task and the Corresponding Skill Categories Under
Team
Chat Skill Category
Member
Person A What does everyone say? Elicit
Person B Avenue A Share
B: several billboard ads arounds town, no clean up crew
Share
Person C provided, full time staff known to be surly
Person A Yuck! Maintain
Person A I have ACB so far? Elicit
c: option to include live local music bands at a low cost, equipped to
Share
Person C host weddings and birthdays, owners like 10 mins from site
For A, it says that there can be multiple events going on at the same
Share
Person A time in close proximity to one another
Person C ohh thats a turn off Share
For C, patrons have to provide their own beverages, but that sounds
Share
Person A helpful (cheaper for BYOB)
Person B I think safety and security should be a priority here Share
B has poor lighting in the parking lot and along outdoor walkways
Share
Person A AND customers must contribute to the cost of liability insurance
Person A 2 minutes left On Track
Person A ACB? or ??? Elicit
Person A B sounds good for food Share
Person B ACB would be best option Share
Person A That's what I had Acknowledge
ChatGPT Models
We chose the OpenAI GPT models deployed on Azure cloud for our study because they
are widely recognized as the most capable large language models. Specifically, we employed the
snapshot models, GPT-3.5-Turbo (version 0613) and GPT-4 (version 0613), configuring the
model temperature to zero and utilizing a fixed random seed for repeatability and consistency.
Prompt Design
The art of crafting effective prompts for Large Language Models (LLMs) represents a
nuanced intersection of creativity and technical acumen (Brown et al., 2020). It has been shown
8
that relevant domain knowledge is essential to create desirable prompts (Zamfirescu-Pereira et
al., 2023), and an iterative refinement process is crucial for leveraging the sophisticated
capabilities of LLMs. As a result, prompt engineering has emerged as a new discipline, where
professionals are tasked to get optimal responses from AI through carefully crafted prompts
(Radford, et al., 2019). On the other hand, when approaching a specific task, one must consider
the virtually limitless possibilities for prompt design. The efficacy of AI in performing tasks
hinges on the sophistication and detail of these prompts. For instance, in an extreme scenario, AI
could simply replicate a solution generated by a human through the prompt. Therefore,
establishing a benchmark for prompt complexity is crucial to ensuring the work is predominantly
from the AI's capabilities rather than the intricacy of human-generated prompts. While setting a
standard for prompt detail is challenging, one guideline could be that the effort spent by a human
in formulating the prompts should not surpass the effort it would take for the person to complete
the task without using AI. In our study, we went through a thorough prompt-engineering
exercise, adhering to best practice recommendations (OpenAI, n.d.), to craft prompts that
directed the AI models to code the chat messages effectively while not overly specify it. Once the
prompts were finalized, we asked ChatGPT to code the chat data directly. Example prompt used
Results
The results of the coding are presented in Table 3. Overall, GPT-3.5 delivers coding with
agreement inferior to that between human coders across the tasks. So, in the following, our
Table 3
9
Performance of ChatGPT coder on four datasets. Each dataset contains about 1500 turns of chat
For the negotiation, problem solving and decision-making tasks, which shared the same
second coding framework, we observed that GPT-4 exhibited a higher level of agreement with
human coders compared to the agreement between human coders themselves. Coding agreement
varied by tasks, and the trend between ChatGPT coding and human coding were similar,
indicating that where the human-human coding agreement for a task was low, the corresponding
human-ChatGPT coding agreement was low too. The coding agreement from the negotiation
tasks was the lowest among the three tasks under the second coding framework for both human
coders and the GPT-4 based coder, which indicates that the contents of the chat data have direct
For the science tasks under coding framework one, the agreement between GPT-4 and
human coder was much lower than that between human coders. This indicates that there is
something human coders can pick up properly that GPT-4 does not. Because the coding of the
chat from the negotiation task already showed that the chat contents affect the coding quality, it
is likely that some peculiarities of the chat contents from the science task are what is responsible
for the low coding quality of GPT-4. To get deeper insight into the chat contents, we created a
word cloud for the nouns in the chats from the science task. As a comparison, we also created a
10
word cloud based on the nouns from chats of the decision-making task where GPT-4 showed
highest agreement with human coding. The results are shown in Figure 2.
We observed a distinct contrast in the lexical composition between the chats from the
science task and those arising from the decision-making task. Chats from the science task
predominantly featured scientific terms such as molecules, air, water, mass, and energy,
indicative of a specialized discussion in science domain. Conversely, chats originating from the
decision-making task exhibited a prevalence of commonplace terms like venue, apartment, rent,
and information, reflecting a more colloquial domain. This observed discrepancy in linguistic
patterns underscores potential disparities in the performance of GPT-4 coding. It is possible that
the relative scarcity of scientific terms in the training data of the GPT-4 model may yield less
refined vector representations, thereby impinging upon the overall coding accuracy, which is
consistent with the decreased performance of ChatGPT in some technical areas (OpenAI, 2023).
Confirmation of this hypothesis requires more extensive comparisons with other datasets, which
Figure 2
Word clouds of the nouns in the chats from science and decision-making tasks. The top panel is
from the science task and the bottom panel is from the decision-making task.
11
Discussion
Our empirical study shows that instructing ChatGPT to code the chats from CPS tasks
into predefined categories can achieve quality comparable to human coders for tasks that do not
involve much technical and scientific discussions. This finding establishes that using ChatGPT to
code chat data from CPS activities is feasible and reliable for some applications. By doing so,
one can substantially reduce both time and costs, opening a new avenue for scalable analysis of
communication data. Meanwhile, we identified several noteworthy issues during the process,
12
Firstly, each LLM model has a maximum context window. The version of GPT-3.5-turbo
and GPT-4 we used have maximum context windows of 4,096 and 8,192 tokens. Thus, we
generally cannot put all the chat turns into a single API call. Instead, we use a batch of about 70
turns of chats corresponding to one collaborative session for each call to the APIs. Secondly, we
observed that the use of batches serves not only to meet the requirement of the context window
size, but also to ensure more reliable coding. We noticed that if the number of chat turns were
high, e.g., more than 200, (a) it would take much longer to get the results back from the API and
(b) there would be a higher chance of getting a mismatch between the number of chat turns sent
and coded chat turns returned it. Thirdly, we observed that even if we set the temperature to zero
and used the same random seed, if we called the API to code the same dataset in different runs,
we still could get some small changes in the coding, though the agreement statistics remained
almost unchanged across the runs. We also verified that averaging the results from multiple runs
Despite the great promise of directly coding chat turns using ChatGPT, we are also aware
of some limitations of the current study. Firstly, the best coding results we obtained using GPT-4
are not necessarily the upper limit of the coding accuracy as better prompt engineering could
lead to more accurate coding. Secondly, our experiments were conducted using the snapshot
version of GPT-3.5 turbo and GPT-4. As these LLMs are constantly evolving and more capable
ones, such as Gemini and Claude 3, appear, our findings that the coding performance decreases
for the science task could be changed with more powerful LLMs. Finally, the coding frameworks
we experimented with are relatively simple, and it is essential for users to conduct thorough
benchmark studies when considering more sophisticated coding frameworks before relying on
ChatGPT to code their data. Nevertheless, the insights gleaned from this study offer valuable
13
evidence to guide researchers in making informed decisions about leveraging ChatGPT for
Acknowledgement
This work was funded in part by the U.S. Army Research Institute for the Behavioral and Social
Sciences (ARI) under Grant # W911NF-19-1-0106, Education Innovation Research (EIR) under
Appendix
Category 1: Chats that help maintain communication. Such as greeting and pre-
task chit-chat, emphatic expression of an emotion or feeling, chat turns that
deal with technical issues, and chat turns that don’t fit into other coding
categories (generally off-topic or typos). Below is a list of example chats
in this category:
["Hi, I'm Jake",
"where are you from?",
"hi, can you see this?”,
"lol",
";-)",
"I’m terrible at this."]
Category 3: Chats that ask for input on the task. Such as asking for
information related to the task, asking for task strategies (regardless of
14
phrasing, no strategy proposed), and asking for task-related goals or
opinions. These chats are usually questions and end with a '?'.
Below is a list of example chats in this category:
["What did you put for best?",
"Do you know what any of the letters are?",
"what about where we'll present?",
"How do I do this?",
"What did you try?",
"Should I add J+J?",
"Why did you put A for best?"]
Category 4: Chats that contribute details for working through the task. Such
as sharing information related to the task, sharing strategies for solving
the task, and sharing task-related goals or opinions.
Below is a list of example chats in this category:
["Utilities are included in A",
"C=3",
"It's alright, we still know A is 5",
"Let's try adding three numbers",
"If A is 1 then B must be 2",
"If utilities are included, it means the rent is higher",
"I had BCA"]
Category 5: Chats that Acknowledge partner(s) input and may continue off that
input with their own. Such as neutrally acknowledge a partner’s statement,
agree with or support a partner’s statement, disagree with a partner’s
statement, adding details to a previously made chat turn (their own or
another player), suggesting a solution of give and take (e.g., if you give me
this, I’ll give you that. May simultaneously reject an option on the table
and propose a compromise).
Below is a list of example chats in this category:
["okay, thanks",
"okay","OK","ok","Ok",
"thanks",
"Done",
"we did it!",
"we nailed it",
"we got all!",
"Good",
"You make a fair point",
"i'll try it",
"Oh I see. That's a good reason."]
Below are the turns of chats. Please assign a category number to each of them
and return the coding list:
1. I think that all atoms are in movement at all time even if the object is
solid
2. yeah that sounds correct
3. ok
4. …
References
Barron, B. (2003). When smart groups fail. The Journal of the Learning Sciences, 12, 307-359.
15
Belal, M., She, J., & Wong, S. (2023). Leveraging chatgpt as text annotation tool for sentiment
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D.
Dillenbourg, P., & Traum, D. (2006). Sharing solutions: Persistence and grounding in multi-
Fatouros, G., Soldatos, J., Kouroumali, K., Makridis, G., & Kyriazis, D. (2023). Transforming
sentiment analysis in the financial domain with ChatGPT. Machine Learning with
Flor, M., Yoon, S. Y., Hao, J., Liu, L., & von Davier, A. (2016, June). Automated classification of
the 11th workshop on innovative use of NLP for building educational applications (pp.
31-41).
Graesser, A., & Foltz, P. (2013). The PISA 2015 Collaborative Problem Solving Framework.
Griffin, P., & Care, E. (Eds.). (2014). Assessment and teaching of 21st century skills: Methods
Hao, J., Chen, L., Flor, M., Liu, L., & von Davier, A. A. (2017). CPS‐Rater: Automated
Hao, J., Liu, L., von Davier, A. A., Lederer, N., Zapata‐Rivera, D., Jakl, P., & Bakkenson, M.
(2017). EPCAL: ETS platform for collaborative assessment and learning. ETS Research
16
Hugging Face. (n.d.). LLM prompting guide. Retrieved on February 29 2024, from
https://huggingface.co/docs/transformers/en/tasks/prompting
Kocoń, J., Cichecki, I., Kaszyca, O., Kochanek, M., Szydło, D., Baran, J., ... & Kazienko, P.
(2023). ChatGPT: Jack of all trades, master of none. Information Fusion, 101861.
Kyllonen, P., Hao, J., Weeks, J., Kerzabi, E., Wang, Y., and Lawless, R., (2023). Assessing
Liu, L., Hao, J., von Davier, A. A., Kyllonen, P., & Zapata-Rivera, J. D. (2016). A tough nut to
Martin-Raugh, M. P., Kyllonen, P. C., Hao, J., Bacall, A., Becker, D., Kurzum, C., ... &
outcomes and tactics across contexts at the individual and collective levels. Computers in
Moldovan, C., Rus, V., & Graesser, A. C. (2011). Automated Speech Act Classification For
OpenAI. (n.d.). Best practices for prompt engineering with the OpenAI API. OpenAI Help
best-practices-for-prompt-engineering-with-the-openai-api
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July).
17
Rosé, C. P., Wang, Y. C., Cui, Y., Arguello, J., Stegmann, K., Weinberger, A., & Fischer, F.
doi:10.1007/s11412-008-9044-7
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., ... & Schmidt, D. C. (2023). A
prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint
arXiv:2302.11382.
Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B., & Yang, Q. (2023, April). Why Johnny
can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings
of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1-21).
18