GPT Cps Annotator Submit

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Scaling up the Evaluation of Collaborative Problem Solving: Promises and Challenges of

Annotating Chat Data with ChatGPT

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi and Lei Liu

Educational Testing Service, Princeton, NJ 08541, USA

Abstract

Collaborative problem solving (CPS) is widely recognized as a critical 21 st century skill.

Efficiently coding communication data is an outstanding challenge in scaling up research on

assessing CPS. This paper reports the findings on using ChatGPT to directly code CPS chat data

by benchmarking performance across multiple datasets and coding frameworks. We found that

ChatGPT-based coding outperformed human coding in tasks where the discussions were

characterized by colloquial languages but fell short in tasks where the discussions dealt with

specialized scientific terminology and contexts. The findings offer practical guidelines for

researchers to develop strategies for efficient and scalable analysis of communication data from

CPS tasks.

Keywords: ChatGPT, Coding, Chat, Collaborative Problem Solving


Email: jhao@ets.org

1
Introduction

The analysis of verbal communication data in collaborative problem solving (CPS) is

crucial for unraveling the intricacies of group interactions, learning dynamics, and knowledge

construction processes. Verbal communication can take various forms, including text chat or

audio interactions. With the rapid advancement of AI technology, audios can be accurately

transcribed into text chat automatically at a high accuracy (Radford, et al., 2023). Therefore, the

primary focus for analyzing the communication data is around coding text chats, which serves as

a rich source of insights into collaborative interactions and cognitive processes. Historically, this

analysis has leaned heavily on the painstaking efforts of human coders, who have meticulously

applied detailed coding frameworks to categorize and interpret the data. Once a portion of the

data has been reliably human-coded (e.g., 10 - 25% of the total dataset; Campbell et al., 2013),

automated annotation methods can be developed using numerical representations of the chats

(e.g., n-grams or neural embeddings) and supervised machine learning classifiers (Flor et al.,

2016; Hao et al., 2017; Moldovan, Rus, & Graesser, 2011; Rosé et al., 2008). Nevertheless, this

pipeline hinges on high-quality human-coded data, which, despite constituting only a fraction of

the total dataset, remains a time-consuming and labor-intensive part of the process. This inherent

limitation significantly constrains the scope and scalability of research projects in this domain.

As the volume of digital communication data continues to grow, there is a pressing need

for more efficient methods that can handle large datasets without compromising the depth and

quality of analysis. Recent advancements in artificial intelligence and natural language

processing have opened new possibilities in this domain, with ChatGPT emerging as a

particularly promising tool (OpenAI, 2023). Unlike previous automated coding methods, which

2
required a substantial amount of human-coded data to train machine learning models, ChatGPT

can be directly instructed to apply various coding frameworks to code chats directly, bypassing

the need for extensive human involvement in coding. This application falls under the category

known as zero-shot or few-shot classification in natural language processing (NLP). While the

concept is straightforward, the true measure of its effectiveness lies in its performance in real-

world scenarios, which is intricately tied to the complexity of the dataset and the underlying

coding framework employed. For straightforward tasks like sentiment analysis, studies have

indicated that ChatGPT generally exhibits good accuracy, albeit with some variance across

datasets (Belal et al., 2023; Fatouros et al., 2023). However, when confronted with more

complex coding challenges, ChatGPT often falls short of meeting expectations (Kocoń et al.,

2023).

This paper presents an empirical study aimed at assessing the effectiveness of employing

ChatGPT for directly coding CPS chat data. By conducting an evaluation of its coding

performance across four datasets and two coding frameworks, our results demonstrate that

ChatGPT, when prompted properly, can achieve coding quality comparable to that of human

coders for CPS tasks where there is not much technical context knowledge. The empirical

findings highlight that harnessing ChatGPT for coding text chat data in some CPS scenarios is

not only viable but also significantly reduces costs and time constraints, thereby opening up

promising avenues for accelerating and expanding research in fields reliant on the analysis of

communication data.

Methodology

In this section, we introduce the details about the data, coding framework, large language

model and the prompt design used in this study.

3
Data

The chat data utilized in this study were from four distinct collaborative problem-solving

tasks administered through the ETS Platform for Collaborative Assessment and Learning

(EPCAL, Hao et al., 2017). The first dataset stemmed from dyadic online collaborative sessions

centered on science tasks. These tasks encompass ten multiple-choice and constructed response

items, alongside a simulation-based science task (Hao et al., 2015). The second through the

fourth datasets were derived from three different types of tasks, negotiation, decision making and

problem solving, administered to 4-person teams (Kyllonen, et al., 2023). Figure 1 shows the

screenshots of these tasks. All these chat data were collected through crowdsourcing, e.g., from

Amazon Mechanical Turk (https://www.mturk.com/) for the science task and Prolific

(https://www.prolific.com/) for the remaining three tasks. In this study, we choose several teams’

collaborative sessions totaling about 1500 chat turns for each task. These chats all have been

double coded by two human coders based on two coding frameworks.

The demographic distribution of the dataset sample was as follows. In the science task

dataset, 51% were female, 80% identified as White, 10% Asian, 5% Black or African American,

and 5% Hispanic or Latino. In the negotiation task sample, 42% were female, 69% identified as

White, 8% Black, 7% Asian, 8% Mixed, and 7% other or no response. The problem-solving task

sample was 64% female, 67% identified as White, 13% Black, 9% Asian, 6% Mixed, and 5%

other or no response. In the decision-making task sample, 62% were female, 81% identified as

White, 6% Black, 6% Asian, 3% Mixed, and 4% other or did no response.

Figure 1

Collaborative Tasks Corresponding to the Data Used in This Study.

(a) Science task (b) Negotiation task

4
(c) Proglem solving task (d) Decision making task

Coding Framework

Two coding frameworks were considered in this study. The first coding framework (Liu

et al., 2016) was built off an extensive review of computer supported collaborative learning

(CSCL) research findings (Barron, 2003; Dillenbourg & Traum, 2006; Griffin & Care, 2014), as

well as the PISA 2015 CPS Framework (Graesser & Foltz, 2013). The coding scheme includes

four major categories designed to assess both social and cognitive processes within group

interactions. The first category, denoted as “Sharing”, captures instances of how individual group

members introduce diverse ideas into collaborative discussions. For example, participants may

share their individual responses to assessment items or highlight pertinent resources that

contribute to problem solving. The secondary category, “Negotiating”, aims to document

evidence of collaborative knowledge building and construction through negotiation. Examples

5
include agreement/disagreement, requesting clarification, elaborating, or rephrasing others’

ideas, identifying cognitive gaps, and revising ideas. The third category, “Regulating”, focuses

on the collaborative regulation aspect of team discourse, including activities like identifying

team goals, evaluating teamwork, and checking understanding. The fourth category,

“Maintaining Communication”, captures content-irrelevant social communications that

contribute to fostering or sustaining a positive communication atmosphere within the team. We

applied this coding framework to code the chat data from the science task.

The second coding framework combined the Liu et al. (2016) framework with one

developed for Negotiation tasks (Martin-Raugh et al., 2020) to classify collaborative

conversations across three task types (negotiation, problem solving, decision making)

administered in Kyllonen et al. (2023). The goal in developing this second coding framework

was to develop a set of categories that could be interpreted comparably across the three diverse

tasks, which would allow us sensibly to explore the generalizability versus task-specificity of

conversational behavior across tasks. The rubric began as a concatenation of conversational

categories and their definitions from Liu et al. (2016) and Martin-Raugh et al. (2020) but was

refined through experiences of five human annotators who coded a subset of conversations

independently, and then refined category definitions and resolved coding disagreements

iteratively until a consensus on category definitions and treatment of particular cases emerged,

and cross-category confusions were kept to a minimum.

The final framework comprised 5 top-level categories, each of which is associated with

two to five subcategories, leading to 17 subcategories in total. For this study we only consider

the 5 top-level categories. The categories were (a) maintaining communication (MC) (greetings,

emotional responses, technical discussions, and other), (b) staying on track (OT) (keep things

6
moving/monitoring time, steering team effort), (c) eliciting information (EI) (eliciting

information from another about the task; eliciting strategies, eliciting goals or opinions), (d)

sharing information (SI) (sharing information, sharing strategies, sharing goals or opinions), and

(e) acknowledging (AK) (acknowledging partner’s input, stating agreement or accepting a

tradeoff, stating disagreement or rejecting a tradeoff, building off one’s own or a teammate’s

idea, and propose a negotiation tradeoff or suggest a compromise).

In Table 1 and 2, we show some example chats from the science and decision-making

tasks together with their corresponding codes by human coders using the corresponding coding

framework.

Table 1

Example Chats from the Science task and the Corresponding Skill Categories Under the First

Coding Framework.

Team
Chat Skill Category
Member
Person A hi Maintain
Person B hey Maintain
Person B Wait you're a real person? Maintain
Person A haha yea, i guess its just a coincidence Maintain
Person A Welp, i guess its gonna be a good day Maintain
Person B Easy 5 dollars Maintain
Person A so i put that they were both moving Share
Person B It's either C or A Share
Person B It's A Share
Sounds good to me, i think molecules are always
Person A
moving Negotiate
Person B Yeah Maintain
Person B Definitly B Share
Person A Well i think it depends on the temperature of the water Negotiate
if the water is all one temp then they are moving at the
Person A
same speed, or am i wrong Negotiate
Person B I'm not sure. Regulate
Person A wanna just go with c Regulate

7
Table 2

Example Chats from the Decision-making Task and the Corresponding Skill Categories Under

the Second Coding Framework.

Team
Chat Skill Category
Member
Person A What does everyone say? Elicit
Person B Avenue A Share
B: several billboard ads arounds town, no clean up crew
Share
Person C provided, full time staff known to be surly
Person A Yuck! Maintain
Person A I have ACB so far? Elicit
c: option to include live local music bands at a low cost, equipped to
Share
Person C host weddings and birthdays, owners like 10 mins from site
For A, it says that there can be multiple events going on at the same
Share
Person A time in close proximity to one another
Person C ohh thats a turn off Share
For C, patrons have to provide their own beverages, but that sounds
Share
Person A helpful (cheaper for BYOB)
Person B I think safety and security should be a priority here Share
B has poor lighting in the parking lot and along outdoor walkways
Share
Person A AND customers must contribute to the cost of liability insurance
Person A 2 minutes left On Track
Person A ACB? or ??? Elicit
Person A B sounds good for food Share
Person B ACB would be best option Share
Person A That's what I had Acknowledge

ChatGPT Models

We chose the OpenAI GPT models deployed on Azure cloud for our study because they

are widely recognized as the most capable large language models. Specifically, we employed the

snapshot models, GPT-3.5-Turbo (version 0613) and GPT-4 (version 0613), configuring the

model temperature to zero and utilizing a fixed random seed for repeatability and consistency.

Prompt Design

The art of crafting effective prompts for Large Language Models (LLMs) represents a

nuanced intersection of creativity and technical acumen (Brown et al., 2020). It has been shown

8
that relevant domain knowledge is essential to create desirable prompts (Zamfirescu-Pereira et

al., 2023), and an iterative refinement process is crucial for leveraging the sophisticated

capabilities of LLMs. As a result, prompt engineering has emerged as a new discipline, where

professionals are tasked to get optimal responses from AI through carefully crafted prompts

(Radford, et al., 2019). On the other hand, when approaching a specific task, one must consider

the virtually limitless possibilities for prompt design. The efficacy of AI in performing tasks

hinges on the sophistication and detail of these prompts. For instance, in an extreme scenario, AI

could simply replicate a solution generated by a human through the prompt. Therefore,

establishing a benchmark for prompt complexity is crucial to ensuring the work is predominantly

from the AI's capabilities rather than the intricacy of human-generated prompts. While setting a

standard for prompt detail is challenging, one guideline could be that the effort spent by a human

in formulating the prompts should not surpass the effort it would take for the person to complete

the task without using AI. In our study, we went through a thorough prompt-engineering

exercise, adhering to best practice recommendations (OpenAI, n.d.), to craft prompts that

directed the AI models to code the chat messages effectively while not overly specify it. Once the

prompts were finalized, we asked ChatGPT to code the chat data directly. Example prompt used

in this study are included in the appendix.

Results

The results of the coding are presented in Table 3. Overall, GPT-3.5 delivers coding with

agreement inferior to that between human coders across the tasks. So, in the following, our

discussion will focus on GPT-4 coding.

Table 3

9
Performance of ChatGPT coder on four datasets. Each dataset contains about 1500 turns of chat

from multiple collaborative sessions.

Human - Human Human – GPT-3.5 Human – GPT-4


Dataset
Accuracy Kappa Accuracy Kappa Accuracy Kappa
Science 0.838 0.779 0.503 0.343 0.700 0.591
Negotiation 0.654 0.527 0.578 0.414 0.699 0.597
Problem Solving 0.814 0.739 0.700 0.593 0.818 0.751
Decision Making 0.793 0.683 0.648 0.507 0.828 0.742

For the negotiation, problem solving and decision-making tasks, which shared the same

second coding framework, we observed that GPT-4 exhibited a higher level of agreement with

human coders compared to the agreement between human coders themselves. Coding agreement

varied by tasks, and the trend between ChatGPT coding and human coding were similar,

indicating that where the human-human coding agreement for a task was low, the corresponding

human-ChatGPT coding agreement was low too. The coding agreement from the negotiation

tasks was the lowest among the three tasks under the second coding framework for both human

coders and the GPT-4 based coder, which indicates that the contents of the chat data have direct

impact on the coding agreement.

For the science tasks under coding framework one, the agreement between GPT-4 and

human coder was much lower than that between human coders. This indicates that there is

something human coders can pick up properly that GPT-4 does not. Because the coding of the

chat from the negotiation task already showed that the chat contents affect the coding quality, it

is likely that some peculiarities of the chat contents from the science task are what is responsible

for the low coding quality of GPT-4. To get deeper insight into the chat contents, we created a

word cloud for the nouns in the chats from the science task. As a comparison, we also created a

10
word cloud based on the nouns from chats of the decision-making task where GPT-4 showed

highest agreement with human coding. The results are shown in Figure 2.

We observed a distinct contrast in the lexical composition between the chats from the

science task and those arising from the decision-making task. Chats from the science task

predominantly featured scientific terms such as molecules, air, water, mass, and energy,

indicative of a specialized discussion in science domain. Conversely, chats originating from the

decision-making task exhibited a prevalence of commonplace terms like venue, apartment, rent,

and information, reflecting a more colloquial domain. This observed discrepancy in linguistic

patterns underscores potential disparities in the performance of GPT-4 coding. It is possible that

the relative scarcity of scientific terms in the training data of the GPT-4 model may yield less

refined vector representations, thereby impinging upon the overall coding accuracy, which is

consistent with the decreased performance of ChatGPT in some technical areas (OpenAI, 2023).

Confirmation of this hypothesis requires more extensive comparisons with other datasets, which

we will report in a future paper.

Figure 2

Word clouds of the nouns in the chats from science and decision-making tasks. The top panel is

from the science task and the bottom panel is from the decision-making task.

11
Discussion

Our empirical study shows that instructing ChatGPT to code the chats from CPS tasks

into predefined categories can achieve quality comparable to human coders for tasks that do not

involve much technical and scientific discussions. This finding establishes that using ChatGPT to

code chat data from CPS activities is feasible and reliable for some applications. By doing so,

one can substantially reduce both time and costs, opening a new avenue for scalable analysis of

communication data. Meanwhile, we identified several noteworthy issues during the process,

which we discuss in the following.

12
Firstly, each LLM model has a maximum context window. The version of GPT-3.5-turbo

and GPT-4 we used have maximum context windows of 4,096 and 8,192 tokens. Thus, we

generally cannot put all the chat turns into a single API call. Instead, we use a batch of about 70

turns of chats corresponding to one collaborative session for each call to the APIs. Secondly, we

observed that the use of batches serves not only to meet the requirement of the context window

size, but also to ensure more reliable coding. We noticed that if the number of chat turns were

high, e.g., more than 200, (a) it would take much longer to get the results back from the API and

(b) there would be a higher chance of getting a mismatch between the number of chat turns sent

and coded chat turns returned it. Thirdly, we observed that even if we set the temperature to zero

and used the same random seed, if we called the API to code the same dataset in different runs,

we still could get some small changes in the coding, though the agreement statistics remained

almost unchanged across the runs. We also verified that averaging the results from multiple runs

did not improve the performance.

Despite the great promise of directly coding chat turns using ChatGPT, we are also aware

of some limitations of the current study. Firstly, the best coding results we obtained using GPT-4

are not necessarily the upper limit of the coding accuracy as better prompt engineering could

lead to more accurate coding. Secondly, our experiments were conducted using the snapshot

version of GPT-3.5 turbo and GPT-4. As these LLMs are constantly evolving and more capable

ones, such as Gemini and Claude 3, appear, our findings that the coding performance decreases

for the science task could be changed with more powerful LLMs. Finally, the coding frameworks

we experimented with are relatively simple, and it is essential for users to conduct thorough

benchmark studies when considering more sophisticated coding frameworks before relying on

ChatGPT to code their data. Nevertheless, the insights gleaned from this study offer valuable

13
evidence to guide researchers in making informed decisions about leveraging ChatGPT for

coding their communication data effectively.

Acknowledgement

This work was funded in part by the U.S. Army Research Institute for the Behavioral and Social

Sciences (ARI) under Grant # W911NF-19-1-0106, Education Innovation Research (EIR) under

Grant # S411C230179, and the ETS Research Allocation.

Appendix

Prompt example for coding the decision-making task.

Students form a team to work on a computer-based collaborative task. Students


use the chat function on the computer to communicate with each other. We will
code students’ chats into five different categories. The five categories are:

Category 1: Chats that help maintain communication. Such as greeting and pre-
task chit-chat, emphatic expression of an emotion or feeling, chat turns that
deal with technical issues, and chat turns that don’t fit into other coding
categories (generally off-topic or typos). Below is a list of example chats
in this category:
["Hi, I'm Jake",
"where are you from?",
"hi, can you see this?”,
"lol",
";-)",
"I’m terrible at this."]

Category 2: Chats that focus discussion to foster progress. Such as Making


Things Move or Monitoring Time to stay on track (not related to task
content), steering conversation back to the task.
Below is a list of example chats in this category:
["ah theres 8 minutes",
"let's get this done.",
"hit next”,
"we should focus on the task",
"let's talk about the next one",
"what are we supposed to do?",
"We're waiting on Sue to submit",
"I am ok lets move on"]

Category 3: Chats that ask for input on the task. Such as asking for
information related to the task, asking for task strategies (regardless of

14
phrasing, no strategy proposed), and asking for task-related goals or
opinions. These chats are usually questions and end with a '?'.
Below is a list of example chats in this category:
["What did you put for best?",
"Do you know what any of the letters are?",
"what about where we'll present?",
"How do I do this?",
"What did you try?",
"Should I add J+J?",
"Why did you put A for best?"]
Category 4: Chats that contribute details for working through the task. Such
as sharing information related to the task, sharing strategies for solving
the task, and sharing task-related goals or opinions.
Below is a list of example chats in this category:
["Utilities are included in A",
"C=3",
"It's alright, we still know A is 5",
"Let's try adding three numbers",
"If A is 1 then B must be 2",
"If utilities are included, it means the rent is higher",
"I had BCA"]

Category 5: Chats that Acknowledge partner(s) input and may continue off that
input with their own. Such as neutrally acknowledge a partner’s statement,
agree with or support a partner’s statement, disagree with a partner’s
statement, adding details to a previously made chat turn (their own or
another player), suggesting a solution of give and take (e.g., if you give me
this, I’ll give you that. May simultaneously reject an option on the table
and propose a compromise).
Below is a list of example chats in this category:
["okay, thanks",
"okay","OK","ok","Ok",
"thanks",
"Done",
"we did it!",
"we nailed it",
"we got all!",
"Good",
"You make a fair point",
"i'll try it",
"Oh I see. That's a good reason."]

Below are the turns of chats. Please assign a category number to each of them
and return the coding list:

1. I think that all atoms are in movement at all time even if the object is
solid
2. yeah that sounds correct
3. ok
4. …

References

Barron, B. (2003). When smart groups fail. The Journal of the Learning Sciences, 12, 307-359.

15
Belal, M., She, J., & Wong, S. (2023). Leveraging chatgpt as text annotation tool for sentiment

analysis. arXiv preprint arXiv:2306.17177.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D.

(2020). Language models are few-shot learners. Advances in neural information

processing systems, 33, 1877-1901.

Dillenbourg, P., & Traum, D. (2006). Sharing solutions: Persistence and grounding in multi-

modal collaborative problem solving.

Fatouros, G., Soldatos, J., Kouroumali, K., Makridis, G., & Kyriazis, D. (2023). Transforming

sentiment analysis in the financial domain with ChatGPT. Machine Learning with

Applications, 14, 100508.

Flor, M., Yoon, S. Y., Hao, J., Liu, L., & von Davier, A. (2016, June). Automated classification of

collaborative problem solving interactions in simulated science tasks. In Proceedings of

the 11th workshop on innovative use of NLP for building educational applications (pp.

31-41).

Graesser, A., & Foltz, P. (2013). The PISA 2015 Collaborative Problem Solving Framework.

Griffin, P., & Care, E. (Eds.). (2014). Assessment and teaching of 21st century skills: Methods

and approach. springer.

Hao, J., Chen, L., Flor, M., Liu, L., & von Davier, A. A. (2017). CPS‐Rater: Automated

sequential annotation for conversations in collaborative problem‐solving activities. ETS

Research Report Series, 2017(1), 1-9.

Hao, J., Liu, L., von Davier, A. A., Lederer, N., Zapata‐Rivera, D., Jakl, P., & Bakkenson, M.

(2017). EPCAL: ETS platform for collaborative assessment and learning. ETS Research

Report Series, 2017(1), 1-14.

16
Hugging Face. (n.d.). LLM prompting guide. Retrieved on February 29 2024, from

https://huggingface.co/docs/transformers/en/tasks/prompting

Kocoń, J., Cichecki, I., Kaszyca, O., Kochanek, M., Szydło, D., Baran, J., ... & Kazienko, P.

(2023). ChatGPT: Jack of all trades, master of none. Information Fusion, 101861.

Kyllonen, P., Hao, J., Weeks, J., Kerzabi, E., Wang, Y., and Lawless, R., (2023). Assessing

individual contribution to teamwork: design and findings. Presentation given at NCME

2023, Chicago, IL., USA.

Liu, L., Hao, J., von Davier, A. A., Kyllonen, P., & Zapata-Rivera, J. D. (2016). A tough nut to

crack: Measuring collaborative problem solving. In Handbook of research on technology

tools for real-world skill development (pp. 344-359). IGI Global.

Martin-Raugh, M. P., Kyllonen, P. C., Hao, J., Bacall, A., Becker, D., Kurzum, C., ... &

Barnwell, P. (2020). Negotiation as an interpersonal skill: Generalizability of negotiation

outcomes and tactics across contexts at the individual and collective levels. Computers in

Human Behavior, 104, 105966.

Moldovan, C., Rus, V., & Graesser, A. C. (2011). Automated Speech Act Classification For

Online Chat. MAICS, 710, 23-29.

OpenAI. (2023). GPT-4 technical report. Retrieved from https://cdn.openai.com/papers/gpt-4.pdf

OpenAI. (n.d.). Best practices for prompt engineering with the OpenAI API. OpenAI Help

Center. Retrieved March 26, 2024, from https://help.openai.com/en/articles/6654000-

best-practices-for-prompt-engineering-with-the-openai-api

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July).

Robust speech recognition via large-scale weak supervision. In International Conference

on Machine Learning (pp. 28492-28518). PMLR.

17
Rosé, C. P., Wang, Y. C., Cui, Y., Arguello, J., Stegmann, K., Weinberger, A., & Fischer, F.

(2008). Analyzing collaborative learning processes automatically: Exploiting the

advances of computational linguistics in computer-supported collaborative learning.

International Journal of Computer-Supported Collaborative Learning, 3(3), 237-271.

doi:10.1007/s11412-008-9044-7

White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., ... & Schmidt, D. C. (2023). A

prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint

arXiv:2302.11382.

Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B., & Yang, Q. (2023, April). Why Johnny

can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings

of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1-21).

18

You might also like