Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Social Skill Training with Large Language Models

Diyi Yang ⋆ Caleb Ziems ⋆ William Held ⋆


Omar Shaikh ⋆ Michael S. Bernstein John Mitchell
Stanford University Georgia Institute of Technology
diyiy@cs.stanford.edu cziems@stanford.edu wheld3@gatech.edu
oshaikh@stanford.edu msb@cs.stanford.edu mitchell@cs.stanford.edu

Abstract time-consuming, and limited in availability. Exist-


ing mechanisms for practice and feedback largely
People rely on social skills like conflict resolu- rely on expert supervision, making training diffi-
tion to communicate effectively and to thrive in cult to scale. In particular, there may be a shortage
arXiv:2404.04204v1 [cs.CL] 5 Apr 2024

both work and personal life. However, practice


of professionally trained coaches (Hoffmann et al.,
environments for social skills are typically out
of reach for most people. How can we make 2023; Wiggan et al., 2021), and most coaches who
social skill training more available, accessible, can provide tailored feedback are not able to help
and inviting? Drawing upon interdisciplinary the large number of people who need it.
research from communication and psychology, Practicing with peers can be a viable alternative
this perspective paper identifies social skill only if peers are experienced. Individuals may also
barriers to enter specialized fields. Then we find it challenging or unsafe to practice high-risk
present a solution that leverages large language tasks. Many people, especially from underrepre-
models for social skill training via a generic
sented groups and disadvantaged populations, have
framework. Our AI Partner, AI Mentor frame-
work merges experiential learning with realistic limited opportunities and social capital to learn
practice and tailored feedback. This work ulti- and practice their target field’s specialized skill
mately calls for cross-disciplinary innovation to frameworks, which can exacerbate social inequal-
address the broader implications for workforce ity (Ovink and Veazey, 2011). We argue that large
development and social equality. language models can help make social skill training
more accessible, safe, and inviting, with tailored
1 Introduction feedback in realistic, virtual practice spaces.
In this position paper, we propose Social Skill
People need both general and domain-specific
Training via two complementary visions of AI
skills to succeed in home and work life (Dean,
assistance, and outline a roadmap for their imple-
2017). Specialized workers need not only tech-
mentation. The first vision is that of the AI Partner,
nical competence, but also field-specific soft skills
which can provide a scalable solution to experien-
that extend broader social skill-sets. For exam-
tial learning through simulated practice. Already,
ple, mental health counselors use active listening
research has shown that human role-play can ef-
(Nemec et al., 2017), a skill for building trust and
fectively teach communication, cooperation, and
empathy (DeVito, 2019; Ramsey and Sohi, 1997).
leadership skills (Gjeraa et al., 2014; Deutsch et al.,
Similarly, business managers employ conflict reso-
2011). Compared to on-the-job training, simula-
lution (De Dreu et al., 2001) and strengthen team
tions allow learners to assume fewer risks and op-
bonds (DeVito, 2019) with specialized strategies
portunity costs. By making simulations accessible,
(Lipsky et al., 2003). Learning these skills may
the AI Partner will reduce the socioeconomic bar-
involve passive observation, trial-and-error, or ex-
rier to enter specialized fields. Our complementary
plicit instruction, but ultimately, a learner will need
vision is the AI Mentor, which will offer person-
deliberate practice (Giddens and Griffiths, 2006),
alized feedback based on domain expertise and
as social skills are inherently interactive.
factual knowledge. Together, the AI Partner and
Learning environments for social skills can be
AI Mentor (APAM) framework can merge experi-
inaccessible, especially when training is offered by
ential learning with realistic practice and tailored
experts in formal programs, which are expensive,
feedback. Our paper calls for cross-disciplinary

Equal contribution. innovation to address APAMs broad implications.
2 LLMs for Characters and Simulation gle pipeline (Wang et al., 2024a; Wu et al., 2022;
Yang et al., 2022) via planning modules (Shinn
Prior research has shown simulation-based learning et al., 2023; Yao et al., 2022). Planning can rely
to be a highly effective educational tool (Blair et al., on traditional search algorithms (Yao et al., 2024),
2007; Tambe et al., 1995; Chernikova et al., 2020). or may separately prompt another LLM to evalu-
These studies worked with manually constructed ate, filter, and rank candidate plans(Shinn et al.,
simulation templates. In this work, we will focus 2023; Yao et al., 2022). Planning may even rely
on LLM-based simulations as a more flexible and on LLM-generated code, which is then executed to
scalable solution. filter viable candidates (Wang et al., 2023).
Prompted LLMs can effectively roleplay believ-
able characters (Argyle et al., 2023; Park et al., 3 The APAM Framework
2022), who operate in specific contexts with plau-
We propose a generic framework for social skill
sible behaviors (Park et al., 2023), realistic prefer-
training with an AI Partner and an AI Mentor
ences (Horton, 2023) and decision-making strate-
(APAM). Both are critical. When a user wants
gies (Zhao et al., 2024), human-like negotiation tac-
to learn a new social skill, the AI Partner can
tics (Gandhi et al., 2023), and empirically-attested
help them practice a relevant scenario with sim-
psychological responses (Aher et al., 2023). Agent-
ulated conversation. The AI Mentor can provide
based simulations, powered by LLMs, have been
knowledge-grounded feedback at critical junctures
used for understanding debates (Du et al., 2023),
of the simulation.
strategic communication (Xu et al., 2023), collab-
oration (Zhang et al., 2023), conflict (Hua et al., 3.1 AI Partner
2023), online behavior (Ren et al., 2024), and even Constructing and deploying an AI Partner is non-
urban planning (Zhou et al., 2024b). trivial for multiple reasons. First, it is difficult to
Towards AI Partner design, the set of methodolo- maintain consistency in the stylistic, behavioral,
gies is rapidly expanding. Prompting can further and emotional characteristics of the simulated part-
be used alongside reinforcement learning to update ner (Weston and Sukhbaatar, 2023). Second, faith-
LLMs according to a set of high-level guiding prin- ful simulations require a high level of complexity
ciples called a constitution (Bai et al., 2022), which and detail that align with the target domain. Third,
can be written in plain text for rapid prototyping simulations should follow an efficient curriculum
(Petridis et al., 2023). Extensions of LLM-based that quickly and deeply educates the learner. Thus,
dialogue models include architectures for character the AI partner should exhibit a consistent, plausible,
consistency (Touvron et al., 2023), and fine-tuning and instructive personality. Note that diversity is
on datasets specific to character simulation and one key component in an instructive system, which
conversation (Thoppilan et al., 2022; Shuster et al., requires the simulated AI partner to demonstrate di-
2022b; Kwon et al., 2023). verse social and cultural attributes. Through these
The development of the AI Mentor requires spe- dimensions, LLMs offer an actionable path to real-
cial care. Unlike the AI Partner, which may sim- ize the ideal AI Partner.
ulate real-world misconceptions, the AI Mentor
should stay grounded in recent expert knowledge, 3.2 AI Mentor
which may not be present in the model’s original The development of AI Mentor heavily relies on
training corpus. As a possible solution, Retrieval domain expertise, context awareness, and feedback
Augmented Generation (RAG; Lewis et al., 2020) efficacy, in addition to consistency that AI partners
can fetch relevant knowledge from external sources exhibit. Domain expertise means the AI Mentor
and dynamically update the prompt (Xu et al., 2021; should cite and elaborate on theories or frameworks
Shuster et al., 2022a; Jiang et al., 2023; Khattab from the related literature to develop their feedback,
et al., 2022). The use of these approaches largely such as psychotherapy, conflict resolution, or ne-
aids the process of leveraging knowledge from text- gotiation, rather than producing generic and broad
books and other curated sources. responses. Context awareness means that the AI
The APAM framework may employ LLMs as Mentor should ground its suggestions in both the
only one part of a larger system that integrates current scenario and the learner’s knowledge state,
retrieval, tool use (Schick et al., 2024), constitu- rather than citing generic or random facts. This cre-
tional decision making, and generation into a sin- ates technical challenges surrounding the handling
pra
tes ctic
lita r wit es
faci ing fo h
lea r n Learner
AI Mentor AI Partner
Mode 1 Mode 1
Conversational Content Static Foundations Rubber Ducks

Mode 2 Mode 2
Theory-Grounded Suggestions Generative Capabilities Peer Roleplay

Mode 3 Mode 3
Generative Possibilities Standardized Partner
Structured Feedback

Figure 1: Modes of the APAM framework. As AI capabilities improve, the APAM framework develops from
its basis in non-AI teaching practices towards the possibility of realistic simulated AI Partner learning scenarios
augmented with AI Mentor feedback that can be personalized based on prior practice sessions between the User and
the AI Partner. With LLMs, our prior work has shown that AI Mentors can effectively generate suggestions based on
best practices (Hsu et al., 2023) and AI Partners can replicate many of the benefits of roleplay (Shaikh et al., 2023a).

of long context and heterogeneous input. Feedback APAM systems to refresh their knowledge. Even
efficacy means the mentor should personalize the if AI partners have imperfections in simulation
communicative style, timing, specificity, and gran- and AI mentors provide relatively rigid theoreti-
ularity of its feedback to most effectively empower cal feedback, the APAM framework can provide
the learner at a given moment. Feedback should benefits by structurally facilitating exploration and
also be empathetic, respectful, and appropriate to analytical self-reflection (e.g., rubber duck debug-
the cultural and social context. ging) (Schon and DeSanctis, 1986; Ku and Ho,
2010). APAM focuses on empowering users to
3.3 Methodology become more aware of where they struggle.
We now propose a generic methodology for Social
Skill Training via LLMs in four steps: (i) under- 3.4 Examples of APAM
standing the social processes that underlie one’s There are many domains where APAM can im-
desired skill (e.g., conflict resolution); (ii) design- prove learners’ skills. Table 1 samples five broad
ing an AI partner to simulate conversations that skill clusters (e.g. active listening) which translate
expose the learner to the target processes, allowing into career-specific domains (mental health coun-
the learner to practice; (iii) creating an AI mentor seling) through field-specific frameworks (motiva-
to provide tailored feedback; (iv) integrating the tional interviewing). These broad skill examples
two agents into a simulated environment for users come with highly-developed psychological tests
to learn safely. These four steps ensure effective and self-report inventories to assess learning objec-
social skill training. It is only through simulation tives. Our framework is not limited only to such
in Step ii that users can practice realistically, and canonical examples; we will discuss how to evalu-
domain knowledge in Step iii that the system can ate APAM systems more broadly in §6.
provide pedagogically effective feedback. Finally, Recent work already exemplifies the APAM
we can determine the success of our system in Step framework. For instance, Sharma et al. (2023) de-
iv when we run comparative user studies. veloped Hailey, an AI Mentor that provides feed-
Beginners are the ideal audience for the APAM back to mental health supporters when writing em-
framework, but skilled workers could also use pathetic messages. To promote better political en-
Social Skill Clusters

Active Listening Conflict Avoidance Conflict Resolution Empathy Rhetoric


(Rogers and Farson, 1957) (Morris-Rothschild and Brassard, 2006) (Behfar et al., 2008) (Smith, 2006) (Aristotle, 1984)
The ability to The ability
Listening to express The ability to resolve The ability to
prevent disagreements to understand
Description understanding of the disagreements or present strong arguments
or differences another person’s
speaker’s intentions. differences of opinion for one’s beliefs
of opinion. experience
Active Listening Dutch Test Dutch Test Facilitative
Jefferson Scale
Evaluation Attitude Scale for Conflict Handling for Conflict Handling Interpersonal Skills
(Hojat et al., 2001)
(Mishima et al., 2000) (Van der Vliert, 2013) (Van der Vliert, 2013) (Anderson et al., 2007)
Application Counseling Classroom Product Management Nursing Litigation
Domain (Nemec et al., 2017) Management (Lipsky et al., 2003) (Yu and Kirk, 2009) (Singer, 1988)
Positive Behavioral
Domain- Motivational Alternative Dispute Person-Centered CREAC Legal
Interventions
Specific Interviewing Resolution Nursing Writing Paradigm
and Supports
Framework (Moyers et al., 2014) (Lipsky et al., 2003) (McCormack and McCance, 2006) (Kraft, 2014)
(Bradshaw et al., 2012)
Learner Novice Therapist Teacher-in-Training Manager Nurse-in-Training Novice Litigator
AI Partner Digitized Patient Virtual Student Simulated Dispute Digitized Patient Simulated Courtroom
AI Mentor Expert Counselor Experienced Teacher Mediator Experienced Nurse Expert Lawyer

Table 1: Different use cases of APAM framework. Therapists and other specialists depend on general skill clusters
like active listening, which are formalized in domain-specific frameworks like motivational interviewing. In this
example (left column), the AI partner might be a digitized patient, while the AI mentor is an expert counselor.

gagement online, Argyle et al. (2023) developed an strategic practice. Since few people have access
AI Mentor system that can provide feedback on po- to the necessary training resources, we developed
lite and validating discourse strategies. In the legal the Rehearsal (Shaikh et al., 2023a) system to pro-
domain, Jiang et al. (2024) leveraged LLMs to help vide these at scale. Rehearsal helps users prac-
non-experts learn intricate legal concepts to make tice conflicts with a believable simulated interlocu-
legal knowledge accessible for encouraging civic tor (AI Partner), identify alternative conversational
participation in democracy. Besides these exam- paths, and learn through feedback on how to ap-
ples, we now discuss three APAM applications in ply specific conflict strategies (AI Mentor). With
more detail: CARE for peer counseling, Rehearsal Rehearsal, users can practice predefined conflict
for conflict resolution, and GPTeach for education. scenarios, or define their own novel scenarios. Our
between-subjects evaluation showed that Rehearsal
CARE (AI Mentor) Peer counseling platforms significantly helps learners navigate later unaided
depend on effective volunteer counselors, but most conflict compared to control groups.
volunteers do not have access to personalized learn-
ing resources. One of our recent works introduces
CARE: an interactive AI Mentor that trains peer GPTeach (AI Partner) For a teaching assistant
counselors with automatic suggestions (Hsu et al., (TA), three important domain-specific skills are
2023). During the practical training stage, CARE academic communication, classroom management,
diagnoses which counseling strategies are most and pedagogy. Learning these skills on-the-job
suitable in the given context and suggests tailored can be stressful (Eddy and Gaston-Gayles, 2013).
responses. Counselors can choose to select, mod- However, novice TAs in particular rarely have the
ify, or ignore these suggestions. We find that this resources they need to develop these skills before
LLM-based system, trained on Motivational Inter- entering the classroom. TA instruction is often lim-
viewing strategies from counseling conversation ited to static guides written by expert TAs; the first
data, significantly helps novice counselors respond time new TAs ever practice is with actual students.
to challenging situations. To reduce potential harms, TAs should have a space
to practice their teaching skills beforehand. To this
Rehearsal (AI Partner) Conflict is an uncom- end, GPTeach (Markel et al., 2023) uses LLMs to
fortable and unavoidable part of life, but people simulate a programming course in which simulated
can learn conflict resolution through deliberate and students make mistakes and ask questions like real
students. This allows novice TAs to practice across Mode 3: Standardized Partner In high-risk do-
a wide range of student behaviors. mains like therapy, AI Partners will need to main-
tain a higher standard of consistency and repro-
4 Vision for Safe Deployment ducibility than most LLM-based simulation sys-
LLMs have strong potential as tools for social skill tems. We call this higher standard the Standardized
training because they can flexibly generate coher- Partner, like the “Standardized Patients” from med-
ent and natural text without the need for exten- ical training (van der Vleuten and Swanson, 1990)
sive topic-specific engineering used by prior works who are professionals trained to reproducibly sim-
(Biswas et al., 2005). However, this flexibility often ulate a patient with specific personality traits and
comes with more limited controllability, making ailments. In Medicine, Standardized Patients can
such deployment dangerous for high-risk scenarios prepare students as effectively as expert practition-
like therapy or mental health crises. ers which shows that expert-training AI may not
Our APAM framework provides guidelines for require the development of expert AI. Achieving
how to safely use AI in social skill training by de- this requires addressing the stereotypes (Shaikh
composing safe usage into a continuum. In this et al., 2022), caricature and tropes (Cheng et al.,
section, each safety recommendation is tailored 2023a,b), as well as misinformation (Lin et al.,
to a specific level of system capabilities. The dif- 2021) produced by today’s LLMs.
ferent modes below represent different capability 4.2 AI Mentor Continuum
clusters that one might foresee from AI Mentors
Mode 1: Conversational Content Where AI
and AI Partners. By selecting a mode dependent on
Partners help learners learn through experience,
current capabilities and limitations, one can safely
AI Mentors connect formal or theoretical knowl-
deploy valuable LLMs without requiring solutions
edge to these experiences. Fundamentally, Mentors
to difficult open technical safety questions.
can also be grounded in non-AI teaching princi-
4.1 AI Partner Continuum ples: when educational materials follow a conver-
sational rather than formal style, learning outcomes
Mode 1: Rubber Ducking Simulation-based
improve consistently (Sorden, 2012). The simplest
learning is grounded in a wealth of cross-
AI Mentors are systems that rephrase dense or dis-
disciplinary research (Cherryholmes, 1966; Dorn,
tractingly formal academic texts into the most un-
1989; Randel et al., 1992; Kincaid et al., 2003;
derstandable register.
Brennan and Vos, 2013). Even simple, low-fidelity
simulations can prove effective, and to demonstrate Mode 2: Theory-Grounded Suggestions In-
this, we will consider the least developed partner: a stead of presenting theories in the abstract, systems
passive, inanimate object. The practice of explain- can offer theory-grounded suggestions to remind
ing your ideas to a rubber duck is called “Rubber learners of the expert theories more naturally. Im-
ducking.” Despite being a passive “partner,” the rub- portantly, the suggestion format does not require
ber duck helps learners identify mistakes through that the system has perfect knowledge, since learn-
the power of social learning and explanation (Ku ers can benefit from judging even imperfect sugges-
and Ho, 2010). While today’s LLMs are certainly tions, developing their own intuitions about when
more powerful than a rubber duck, this highlights theory applies. CARE (Hsu et al., 2023) is one
how “partners” can be valuable and safe even with- such work that tests the limits of these capabilities.
out human-level capabilities.
Mode 3: Structured Feedback AI Mentors can
Mode 2: Peer Roleplay Current Partner tech- be improved to provide structured, actionable, and
nologies (e.g., Rehearsal) resemble untrained peers personalized suggestions with a greater scope and
roleplaying unfamiliar situations. While these sim- time-scale than the local-level feedback of Mode
ulations share surface level characteristics with real 2. This would require reasoning over long, multi-
social situations, they often lack nuance, especially turn conversations to an extent not possible with
for roles which peers may not have lived experience the attention mechanisms and context length lim-
with (Matz and Ebner, 2010). Despite this short- itations of current LLMs (Liu et al., 2024). The
coming, roleplay has long been a valuable tool for technical requirements of such an AI Mentor may
curriculum design, since it can help move learners be greater than that of developing an AI Expert
from abstract theories to real-world practice. directly, as teaching a task can be more challenging
than performing a task. We believe this challenge istics about an individual (Shaikh et al., 2023b;
is merited as AI Mentors can create lasting value Chiu et al., 2024). In educational or mental health
even after the exposure to the AI Mentor system contexts, assuming the source of struggles can re-
ends. This enhances rather than diminishes human sult in irrelevant feedback (Graesser et al., 1995;
expertise in critical areas and avoids creating an Strumwasser et al., 1991; Wang et al., 2024b). In-
ongoing dependency on AI. tegrating multi-turn preference optimization into
LLMs is one promising avenue for future work;
5 Technical Challenges Hong et al. (2023) and Andukuri et al. (2024),
To create effective avenues for social skill training, for example, explore RLHF and self-improvement
APAM-systems require work on concrete techni- methods to generate conversational grounding.
cal challenges. In this section, we outline a core However, identifying where and why humans take
subset, prioritizing long-term interaction, expert- time to establish common ground across a diverse
driven design of APAM systems, personalization, set of situations—and training LLMs to reflect this
and designing for end-user interaction. behavior—is still an open problem.

5.1 Optimizing Long-Term Interactions 5.2 Integrating Expert Frameworks


For training systems to be effective and safe (Dem-
Persona Consistency First, LLMs should remain
szky et al., 2023), they should be closely integrated
consistent when simulating social skill training.
with domain-specific expert skill frameworks like
When individuals practice with LLMs over mul-
motivational interviewing (c.f., Table 1). With
tiple turns, an LLM should not "forget" aspects
LLM agents, however, adherence to specific frame-
in the initial prompt, e.g. by providing feedback
works is not guaranteed. By default, LLMs gen-
unrelated to the initial instruction or ignoring at-
erate text in an unconstrained fashion. Ensuring
tributes of the roleplayed simulation. Like Weston
generation adherence to expert frameworks is a
and Sukhbaatar (2023), we suspect a limitation of
highly domain-specific process. For effective in-
the attention mechanism in LLMs. As the context
teraction, experts must first outline and demon-
window increases, models may place less attention
strate specific strategies and constraints (Agrawala
on initial instructions. One avenue of future work
et al., 2011). Learning how to integrate these con-
involves designing modeling or prompting meth-
straints into an LLM—either by finetuning a new
ods to enforce consistency. Already, Ghost Atten-
model, designing new constrained decoding pro-
tion (Touvron et al., 2023) and System Two Atten-
cesses (Keskar et al., 2019), or building prompting
tion (Weston and Sukhbaatar, 2023) offer technical
pipelines (Wu et al., 2022)—are important avenues
solutions to maintaining instruction consistency.
for future work. Finally, APAM-based systems
Beyond modeling solutions, benchmarking consis-
should allow a practitioner to reflect on theory in
tency across multi-turn skill training—either by
structured feedback or peer role-play (Schon and
collecting real-world interaction datasets or con-
DeSanctis, 1986). For example, technical meth-
structing a static benchmark—would highlight de-
ods should enable users to explore counterfactual
ficiencies, addressable by future work.
roleplays or feedback grounded in theory.
Conversational Grounding is a fundamental
component of interpersonal interaction (Clark, 5.3 Designing for End-User Control
1996). We often take multiple turns to estab- While we focus on technical methods that improve
lish an utterance as common ground. Humans social skill training for LLMs, we note that these
do not simply "follow instructions"—we ask clar- methods should be amenable to user adjustments.
ification questions, follow up on underlying de- If an individual decides to change the underlying
tails, or repair assumptions made by other in- framework used by an LLM for training, adjust-
terlocutors (Clark and Schaefer, 1989). This is ments should be directly possible in the system
critical for social skill training: dialogue agents itself. Systems like ConstitutionMaker (Petridis
must build grounding with individuals before gen- et al., 2023), for example, allow end-users to de-
erating feedback. Current instruction-following sign and edit prompting principles through an inter-
LLMs, however, are trained on single-step inter- active interface. Similarly, new technical methods
actions (Ouyang et al., 2022). Thus, contempo- should come with interactive complements that en-
rary LLMs generate text that assumes character- able direct control (Shneiderman, 1983). Since
training systems are inherently user-facing, design- ity, as they are both simple and allow for a rich
ing interactive interfaces that allow for accessible definition of desired behavior through demonstra-
control—either by an expert or learner—will allow tions. While many works aim to optimize metric
individuals to customize the type of training they reliability (Hashimoto et al., 2019), they often lose
receive, instead of depending on a researcher. statistical power as researchers implicitly optimize
metrics on fixed datasets (Goyal et al., 2023).
5.4 Personalizing Skill Training In pursuit of more flexible assessments, practi-
Personalization is a key challenge even for gen- tioners often turn to human qualitative assessments
eral tasks around LLMs (Mysore et al., 2023; Tan of systems, either through Likert scale-based scor-
et al., 2024; Wu et al., 2021). This connects to the ing or comparative ranking of systems. In these
consistency and diversity attributes of AI Partner, procedures, system developers develop a rubric of
as well as feedback efficacy of AI Mentor. Effec- desirable qualities such as believability, helpful-
tive skill training tailors feedback and experiences ness, and consistency. These metrics are a useful
to meet the needs of each learner. Such personal- gauge of the quality of these interactive systems.
ized training has been made possible via APAM However, as systems become more generally co-
as learners can select and design AI partners or herent and rubrics become more fine-grained, these
mentors that are relevant to them. It is, however, methods of human validation often raise repro-
not trivial to operationalize personalization (Flek, ducibility concerns (Karpinska et al., 2021). While
2020; Dudy et al., 2021). Prior research has in- LLMs themselves are increasingly used to replace
vestigated various writing styles (e.g., formal vs the human annotators in these processes (Dubois
informal, simplified vs sophisticated language us- et al., 2024), this raises separate concerns about
age) (Alhafni et al., 2024; Li et al., 2023), and the systemic judgment biases of the LLM as a
learners’ expertise in certain topics (Bernacki et al., judge (Zheng et al., 2024). As such, other stud-
2021). Building upon these prior studies and taking ies have focused on more coarse, functional met-
into account the appropriate set of personalization- rics from user studies such as the Recommender
related attributes—as well as learners’ knowledge Score (Markel et al., 2023) or the rate at which
or expertise backgrounds (Huang et al., 2012)— users make use of system outputs (Hsu et al., 2023).
becomes increasingly important. To develop effective evaluations of more pow-
erful systems, we believe domain users need to be
6 Evaluation involved as collaborators, rather than just as an-
notators. Potential users are best placed to assess
The evaluation of AI partners and AI mentors is a the intrinsic measures that make a system usable,
major challenge; tools based on APAM involve confusing, or even harmful. In current procedures,
complex computational systems and interaction users are assigned to predefined roles assessing sys-
with users with varying desires and backgrounds. tems along strictly defined rubrics created by the
To develop these training tools as a field, evalua- system designers which centers the process on the
tion measures need to move beyond metrics tradi- developer’s preconceptions (Birhane et al., 2022).
tionally used in Natural Language Processing to Resolving this is not a simple matter of involv-
protocols from multiple relevant fields and stake- ing different individuals for the above evaluation
holders. Including multidisciplinary perspectives metrics. Instead, potential users should be involved
will help evaluate the empirical performance of as stakeholders in high-level design—before de-
such systems, their usability from a user perspec- velopment begins—to center the process around
tive, and their long-term impacts on both users and recognizing end-user expertise. For example, in-
their communities. volving experts in the design of a training tool
At present, research on text generation focuses may highlight pedagogical theory overlooked by
largely on intrinsic evaluations which assess the researchers. Watching end-users interact with pro-
quality of outputs with predefined rubrics or inter- totype APAM systems will highlight opportunities
actions. In Table 2, we separate these into fully for new technical methods or interactions. Prac-
automated evaluations and user-driven evaluations. tices from participatory design in HCI can serve
Reference-based metrics, such as perplexity or as helpful guidelines for APAM platforms (Muller
Kullback–Leibler divergence (Theis et al., 2016), and Kuhn, 1993; Schuler and Namioka, 1993).
are common automated assessments of system qual- Ultimately, however, extrinsic evaluation of how
Intrinsic Evaluation
Metric Type Description Examples Applicability Category
Metrics of the similarity and distinguishability
Reference Based
between a systems interactions and a set of gold (Hashimoto et al., 2019) APAM Automated
Evaluation
standard interactions.
Assessment of relevance of topics covered compared
Topic Analysis (Cheng et al., 2023b) AP
to expectations.
Using trained classifiers to categorize the frequency
Classifier Based Scoring (Sharma et al., 2024) APAM
of known effective and realistic behaviours.
Prompting LLMs to act as automated judges which (Zhou et al., 2023;
LLM Prompt Scoring AP
provides Likert scale scores for a simulation. Dubois et al., 2024)
Comparative metric where systems are ranked based (Park et al., 2023; Zhou
Human Ranking APAM
on a rubric of evaluation. et al., 2024a)
Likert Scale ratings of the system along given a
Human Scoring (Thoppilan et al., 2022) APAM
rubric of evaluation.
Rate at which participants utilize suggestions
Suggestion Usage (Hsu et al., 2023) AM User
provided by an AI mentor system.
Rating of how likely a user would be to recommend
Recommender Score (Markel et al., 2023) APAM
the system to a friend.
Extrinsic Evaluation
Metric Type Description Examples Applicability Category
Changes in qualitatively coded participant behaviors (Shaikh et al., 2023a;
Behavioral Impacts APAM Short-Term
before and after exposure to the system. Markel et al., 2023)
Changes in participants’ self-reported efficacy on the
Self-Efficacy Reports (Shaikh et al., 2023a) APAM
skills practiced before and after exposure.
Changes in participant scores on closed-ended
Standardized Evaluation (Shaikh et al., 2023a) APAM
assessments of knowledge about the skills practiced.
Short-Term Economic Impacts of a training program on participants (Adhvaryu et al., 2018;
APAM
Outcomes short-term wages and employment. Chioda et al., 2021)
(Oreopoulos and Salvanes,
Impacts on non-financial measures such as health,
Non-Financial Benefits 2011; Heckman and APAM Long-Term
risk-taking behaviors, and levels of societal trust.
Kautz, 2012)
Long-Term Economic Impacts of a training program on long-term earnings, (Barrera-Osorio et al.,
APAM
Outcomes workplace stability, and economic mobility. 2023)

Table 2: Intrinsic and Extrinsic Evaluation Procedures applicable to APAM systems from prior work. At
present, Natural Language Processing practitioners primarily focus on intrinsic evaluations for their systems. Here,
we stress the importance of evaluating APAM systems using established measures for educational outcomes.

interaction with a tool changes the behavior of trinsic evaluation perspectives. As specific methods
the individuals who use it. In the case studies begin to show promise through intrinsic automated
we cover, measurable changes in behavior, self- metrics, they can begin to be utilized as part of
efficacy reports, and standardized test scores have human-centered systems design processes. Once a
been used to evaluate short-term impacts. Whether particular system is deemed satisfactory and capa-
these shifts are sustained and whether they create ble by initial users, it can begin real-world trials to
beneficial outcomes for participants, however, re- assess impacts from long-term use. This form of
quires moving beyond what is currently evaluated development—where deployment is staged along-
in NLP. Educators and economists, however, have side increasingly broad evaluation criteria—is key
deep experience designing studies to evaluate the to the safe rollout of research with both high im-
long-term impacts of training tools and education. pact and potential risk (Mohs and Greig, 2017).
This is most commonly utilized for economic out- For APAM and other NLP systems where value is
comes, since these are the most straightforwardly derived through direct user interaction, we benefit
measured benefits of soft-skills training (Adhvaryu by learning best practices specific to each domain.
et al., 2018; Chioda et al., 2021; Barrera-Osorio Finally, given the high stakes but relatively low-
et al., 2023). The benefits of soft-skills training cost of access to these tools, providing algorithmic
have additionally been measured through social auditors transparent access to the systems through-
outcomes that correlate with general quality of out this process should be standard practice in the
life (Oreopoulos and Salvanes, 2011; Heckman development of APAM systems (Raji et al., 2020).
and Kautz, 2012). Risk factors and adverse events (e.g., simulation
We believe NLP researchers will develop more failures, hallucination, over-reliance) stemming
impactful systems by taking both intrinsic and ex- from any of these evaluation procedures should
be released in detail, rather than reported in ag- are often inaccessible, especially to the socioeco-
gregate, in order to facilitate external analysis of nomically disadvantaged. Beyond improving the
possible trends such as via the use of best practices quality of existing outcomes, APAM could make
in medical trials (FDA, 2009). the same high-quality learning environments avail-
able to everyone. AI alone is unlikely to solve all
7 Discussion educational and social inequity. However, by fo-
cusing on skills that are often learned informally
7.1 Societal Impact
within social groups, APAM can reduce structural
The APAM framework can help in a variety of soci- inequities which often compound for the already
etal applications. Broadly, APAM can be designed disadvantaged across generations (Oded, 2011).
to help increase soft skills like self-awareness, so-
cial awareness, self-regulation, and relationship 7.2 Concerns and Mitigation
building, all of which can lead to personal well- Despite APAM’s benefits, there are still a set of is-
being and broader social good outcomes (Jagers sues we need to be aware of when building systems
et al., 2018). Take the soft skill of self-awareness that can train social skills:
as an example: self-awareness is an understanding
of personal and cultural identity, which correlates Stereotypes. LLM simulations can output cari-
with physical wellbeing (Taylor and Usborne, 2010; catures (Cheng et al., 2023b) when prompted with
Schwartz et al., 2008), mental health (Bhugra and broad characteristics such as race and gender. Of-
Becker, 2005), and beneficial learning outcomes ten, stereotypes arise from under-description in the
(Altugan, 2015). Psychologically, self-awareness is prompt (e.g., stereotypically casting a "boss" as a
a foundation for optimism, confidence, and a sense white male). We recommend that system designers
of agency. Relatedly, training self-regulation skills highlight and encourage users to specify attributes
like stress management and intrinsic motivation via that are important to the simulation (e.g., gender),
APAM can lead to healthy lifestyle choices (Antoni enabling users to make changes as needed in the in-
et al., 2006). Relationship building skills like con- terface. Existing metrics of caricature can be used
flict resolution and communication enhance group to raise warnings to users when simulations drift
learning and cooperative work, but most impor- towards stereotypes while giving users full control
tantly for the individual, these strong relationships over the personas with which they practice.
are expected to provide critical social support (See-
Distributional Shifts. APAM is designed primar-
man et al., 2001) and a higher quality of life (Co-
ily as a safe training environment in which users
hen, 2004). Additionally, skills of empathy and
can build, practice, and transfer their social skills
perspective-taking form a broader social aware-
to the real world. We recommend that system de-
ness and are the foundations of strong citizenry
signers should identify and clearly communicate
(Wray-Lake and Syvertsen, 2011). Collectively,
the limitations of the simulation and its deviations
our APAM framework is expected to provide both
from reality. We also recommend that any sys-
individual and societal benefits in a more equitable
tem based on APAM should take a human-centered
manner, as it is designed to provide social learning
development process, observing and interviewing
opportunities to everyone.
real users to understand the system’s behaviors and
APAM is a curriculum design tool that could
gather feedback for iteration. Such evaluation will
enable learners to co-construct their learning paths
help track the nature and extent of any discrep-
could empower people to actively discover, define,
ancies between the simulation and the real world.
and fill new niches in the economy. Some concrete
Finally, to guard against users who might rely on
applications of APAM are achievable in the short
the system too heavily, we recommend that the
term. However, the potential impact of this broader
system caution users against overuse.
vision necessitates further exploration. This re-
quires new experts to design, train, and maintain Job Risks. APAM is not designed as a direct re-
AI tooling (Wilson et al., 2017) for APAM, as well placement for paid human labor. Instead, APAM
as curriculum design experts to assess when and is a curriculum design tool that allows learners to
where true practice and mentorship is irreplaceable. co-construct their learning paths, empowering peo-
Even in areas where current systems for social ple to actively discover, define, and fill niches in
learning of domain-critical soft skills exist, they the economy. At the corporate level, social skill
training programs will still need professional super- sessions with real users.
vision, and these professionals can use automated Overall, the success of APAM depends on fos-
tools for training events, just as they might have tering interdisciplinary collaborations and team sci-
used a static curriculum or textbook in the past. ence across diverse research fields and across both
Some individuals may opt for a cheap or free stan- academic and professional communities. Such a
dalone option if they are on the margin. As such, balanced, intentional, and collaborative process is
we weigh this risk against the potential benefit of essential for using LLMs for social good, particu-
a free-to-use tool, which can assist a broader user larly in areas such as social skill training.
population, especially those without professional
training or social capital. Professional experts will
be able to focus on more tailored, challenging sce- References
narios for skill training, while still maintaining their Achyuta Adhvaryu, Namrata Kala, and Anant Nyshad-
high level of expertise and uniqueness. ham. 2018. The skills to pay the bills: Returns to
on-the-job soft skills training. Technical report, Na-
tional Bureau of Economic Research.
8 Summary and Outlook
Maneesh Agrawala, Wilmot Li, and Floraine
This perspective paper examines a widespread chal- Berthouzoz. 2011. Design principles for visual
lenge: mastering essential social skills for both per- communication. Commun. ACM, 54(4):60–69.
sonal and professional success. Opportunities to Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai.
practice these skills in a safe learning environment 2023. Using large language models to simulate mul-
are limited, especially for underprivileged groups. tiple humans and replicate human subject studies.
We show how LLMs can help create environments In International Conference on Machine Learning,
pages 337–371. PMLR.
where everyone can practice social skills via our
proposed AI Partner and AI Mentor framework. Bashar Alhafni, Vivek Kulkarni, Dhruv Kumar, and
Here, the AI Partner offers a risk-free practice envi- Vipul Raheja. 2024. Personalized text generation
ronment, while the AI Mentor provides knowledge- with fine-grained linguistic control. ArXiv preprint,
abs/2402.04914.
able tailored advice.
Below we highlight a few take-aways that illus- Arzu Sosyal Altugan. 2015. The relationship between
trate how this approach can reshape social skills cultural identity and learning. Procedia-Social and
learning moving forward. Firstly, utilizing LLMs Behavioral Sciences, 186:1159–1162.
on APAM requires addressing multiple technical T Anderson, CL Patterson, and AC Weis. 2007. Facilita-
challenges, such as enhancing the simulation of tive interpersonal skills performance analysis rating
AI partners to exhibit a consistent, plausible and method. Unpublished coding manual, Department of
instructive personality, and building AI mentors to Psychology, Ohio University, Athens, OH.
have context awareness, domain expertise and feed- Chinmaya Andukuri, Jan-Philipp Fränken, Tobias Ger-
back efficiency. Secondly, deploying LLM based stenberg, and Noah D Goodman. 2024. Star-gate:
social skill training systems has the potential to Teaching language models to ask clarifying questions.
amplify limitations such as hallucinations and bi- ArXiv preprint, abs/2403.19154.
ases, thus our APAM framework offers a roadmap Michael H Antoni, Suzanne C Lechner, Aisha Kazi,
for how to use LLMs for social skill training by Sarah R Wimberly, Tammy Sifre, Kenya R Urcuyo,
breaking safe usage into a continuum dependent on Kristin Phillips, Stefan Glück, and Charles S Carver.
2006. How stress management improves quality of
current capabilities. That is, the safe deployment of
life after treatment for breast cancer. Journal of con-
APAM should emphasize a gradual, risk-aware ap- sulting and clinical psychology, 74(6):1143.
proach, as controllability and consistency improve
for LLMs. Additionally, training social skills via Lisa P Argyle, Christopher A Bail, Ethan C Busby,
Joshua R Gubler, Thomas Howe, Christopher Rytting,
LLMs might suffer from stereotypes and biases in Taylor Sorensen, and David Wingate. 2023. Leverag-
LLM based simulation, distribution shifts and user ing ai for democratic discourse: Chat interventions
reliance, as well as potential risks around job re- can improve online political conversations at scale.
placement. We recommend system designers take Proceedings of the National Academy of Sciences,
a human-centered development process, together 120(41):e2311627120.
with formative evaluations that iterate on the sys- Aristotle. 1984. Rhetoric. Modern Library, New York.
tem’s behaviors using feedback from observation Translated from the Greek.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Olga Chernikova, Nicole Heitzmann, Matthias Stadler,
Amanda Askell, Jackson Kernion, Andy Jones, Doris Holzberger, Tina Seidel, and Frank Fischer.
Anna Chen, Anna Goldie, Azalia Mirhoseini, 2020. Simulation-based learning in higher education:
Cameron McKinnon, et al. 2022. Constitutional A meta-analysis. Review of Educational Research,
ai: Harmlessness from ai feedback. ArXiv preprint, 90(4):499–541.
abs/2212.08073.
Cleo H Cherryholmes. 1966. Some current research
Felipe Barrera-Osorio, Adriana Kugler, and Mikko Silli- on effectiveness of educational simulations: Implica-
man. 2023. Hard and soft skills in vocational training: tions for alternative strategies. American Behavioral
Experimental evidence from colombia. The World Scientist, 10(2):4–7.
Bank Economic Review, 37(3):409–436.
Laura Chioda, David Contreras-Loya, Paul Gertler, and
Kristin J Behfar, Randall S Peterson, Elizabeth A Man- Dana Carney. 2021. Making entrepreneurs: Returns
nix, and William MK Trochim. 2008. The critical to training youth in hard versus soft business skills.
role of conflict resolution in teams: A close look at Technical report, National Bureau of Economic Re-
the links between conflict type, conflict management search.
strategies, and team outcomes. Journal of applied
psychology, 93(1):170. Yu Ying Chiu, Ashish Sharma, Inna Wanyin Lin, and
Tim Althoff. 2024. A computational framework
Matthew L Bernacki, Meghan J Greene, and Nikki G for behavioral assessment of llm therapists. ArXiv
Lobczowski. 2021. A systematic review of research preprint, abs/2401.00820.
on personalized learning: Personalized by whom, to
what, how, and for what purpose (s)? Educational Herbert H Clark. 1996. Using language. Cambridge
Psychology Review, 33(4):1675–1715. university press.
Dinesh Bhugra and Matthew A Becker. 2005. Mi- Herbert H Clark and Edward F Schaefer. 1989. Con-
gration, cultural bereavement and cultural identity. tributing to discourse. Cognitive science, 13(2):259–
World psychiatry, 4(1):18. 294.
Abeba Birhane, William Isaac, Vinodkumar Prab- Sheldon Cohen. 2004. Social relationships and health.
hakaran, Mark Diaz, Madeleine Clare Elish, Iason American psychologist, 59(8):676.
Gabriel, and Shakir Mohamed. 2022. Power to the
people? opportunities and challenges for participa- Carsten KW De Dreu, Arne Evers, Bianca Beersma,
tory ai. In Proceedings of the 2nd ACM Conference Esther S Kluwer, and Aukje Nauta. 2001. A theory-
on Equity and Access in Algorithms, Mechanisms, based measure of conflict management strategies in
and Optimization, pages 1–8. the workplace. Journal of Organizational Behav-
Gautam Biswas, Krittaya Leelawong, Daniel Schwartz, ior: The International Journal of Industrial, Occupa-
Nancy Vye, and The Teachable Agents Group at Van- tional and Organizational Psychology and Behavior,
derbilt. 2005. Learning by teaching: A new agent 22(6):645–668.
paradigm for educational software. Applied Artificial
Susan A Dean. 2017. Soft skills needed for the 21st
Intelligence, 19(3-4):363–392.
century workforce. Walden University.
Kristen Blair, Daniel L Schwartz, Gautam Biswas, and
Krittaya Leelawong. 2007. Pedagogical agents for Dorottya Demszky, Diyi Yang, David S Yeager, Christo-
learning by teaching: Teachable agents. Educational pher J Bryan, Margarett Clapper, Susannah Chand-
Technology, pages 56–61. hok, Johannes C Eichstaedt, Cameron Hecht, Jeremy
Jamieson, Meghann Johnson, et al. 2023. Using large
Catherine P Bradshaw, Tracy E Waasdorp, and Philip J language models in psychology. Nature Reviews Psy-
Leaf. 2012. Effects of school-wide positive behav- chology, 2(11):688–701.
ioral interventions and supports on child behavior
problems. Pediatrics, 130(5):e1136–e1145. Morton Deutsch, Peter T Coleman, and Eric C Marcus.
2011. The handbook of conflict resolution: Theory
Ross Brennan and Lynn Vos. 2013. Effects of partici- and practice. John Wiley & Sons.
pation in a simulation game on marketing students’
numeracy and financial skills. Journal of Marketing Joseph A DeVito. 2019. The interpersonal communica-
Education, 35(3):259–270. tion book. Instructor, 1(18):521–532.

Myra Cheng, Esin Durmus, and Dan Jurafsky. 2023a. Dean S Dorn. 1989. Simulation games: One more tool
Marked personas: Using natural language prompts on the pedagogical shelf. Teaching Sociology, pages
to measure stereotypes in language models. ArXiv 1–18.
preprint, abs/2305.18189.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen-
Myra Cheng, Tiziano Piccardi, and Diyi Yang. 2023b. baum, and Igor Mordatch. 2023. Improving factual-
Compost: Characterizing and evaluating caricature ity and reasoning in language models through multia-
in llm simulations. ArXiv preprint, abs/2310.11501. gent debate. ArXiv preprint, abs/2305.14325.
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, shortage areas in the us. JAMA pediatrics, 177(1):71–
Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy 80.
Liang, and Tatsunori B. Hashimoto. 2024. Alpaca-
farm: A simulation framework for methods that learn Mohammadreza Hojat, Salvatore Mangione, Thomas J
from human feedback. Preprint, arXiv:2305.14387. Nasca, Mitchell JM Cohen, Joseph S Gonnella,
James B Erdmann, Jon Veloski, and Mike Magee.
Shiran Dudy, Steven Bedrick, and Bonnie Webber. 2021. 2001. The jefferson scale of physician empathy: de-
Refocusing on relevance: Personalization in nlg. In velopment and preliminary psychometric data. Edu-
Proceedings of the Conference on Empirical Meth- cational and psychological measurement, 61(2):349–
ods in Natural Language Processing. Conference on 365.
Empirical Methods in Natural Language Processing,
volume 2021, page 5190. NIH Public Access. Joey Hong, Sergey Levine, and Anca Dragan. 2023.
Zero-shot goal-directed dialogue via rl on imagined
Pamela L Eddy and Joy L Gaston-Gayles. 2013. New
conversations. ArXiv preprint, abs/2311.05584.
faculty on the block: Issues of stress and support. In
Faculty stress, pages 89–106. Routledge.
John J Horton. 2023. Large language models as sim-
FDA. 2009. Adverse event reporting to irbs improv- ulated economic agents: What can we learn from
ing human subject protection. Guidance Clinical homo silicus? Technical report, National Bureau of
Investigators, Sponsors, and IRBs. Economic Research.

Lucie Flek. 2020. Returning the N to NLP: Towards Shang-Ling Hsu, Raj Sanjay Shah, Prathik Senthil,
contextually personalized classification models. In Zahra Ashktorab, Casey Dugan, Werner Geyer, and
Proceedings of the 58th Annual Meeting of the Asso- Diyi Yang. 2023. Helping the helper: Supporting
ciation for Computational Linguistics, pages 7828– peer counselors via ai-empowered practice and feed-
7838, Online. Association for Computational Lin- back. ArXiv preprint, abs/2305.08982.
guistics.
Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei,
Kanishk Gandhi, Dorsa Sadigh, and Noah D Goodman. Jianchao Ji, Yingqiang Ge, Libby Hemphill, and
2023. Strategic reasoning with language models. Yongfeng Zhang. 2023. War and peace (waragent):
ArXiv preprint, abs/2305.19165. Large language model-based multi-agent simulation
of world wars. ArXiv preprint, abs/2311.17227.
Anthony Giddens and Simon Griffiths. 2006. Sociology.
Polity.
Yueh-Min Huang, Tsung-Ho Liang, Yen-Ning Su, and
Kirsten Gjeraa, Thea Palsgaard Møller, and D Øster- Nian-Shing Chen. 2012. Empowering personalized
gaard. 2014. Efficacy of simulation-based trauma learning with an interactive e-book learning system
team training of non-technical skills. a system- for elementary school students. Educational technol-
atic review. Acta Anaesthesiologica Scandinavica, ogy research and development, 60:703–722.
58(7):775–787.
Robert J Jagers, Deborah Rivas-Drake, and Teresa
Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2023. Borowski. 2018. Equity & social and emotional
News summarization and evaluation in the era of learning: A cultural analysis. CASEL Assessment
gpt-3. Preprint, arXiv:2209.12356. Work Group Brief series.
Arthur C Graesser, Natalie K Person, and Joseph P Hang Jiang, Xiajie Zhang, Robert Mahari, Daniel
Magliano. 1995. Collaborative dialogue patterns in Kessler, Eric Ma, Tal August, Irene Li, Alex’Sandy’
naturalistic one-to-one tutoring. Applied cognitive Pentland, Yoon Kim, Jad Kabbara, et al. 2024. Lever-
psychology, 9(6):495–522. aging large language models for learning complex
legal concepts through storytelling. ArXiv preprint,
Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang.
abs/2402.17019.
2019. Unifying human and statistical evaluation for
natural language generation. In Proceedings of the
2019 Conference of the North American Chapter of Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing
the Association for Computational Linguistics: Hu- Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang,
man Language Technologies, Volume 1 (Long and Jamie Callan, and Graham Neubig. 2023. Active
Short Papers), pages 1689–1701, Minneapolis, Min- retrieval augmented generation. ArXiv preprint,
nesota. Association for Computational Linguistics. abs/2305.06983.

James J Heckman and Tim Kautz. 2012. Hard evidence Marzena Karpinska, Nader Akoury, and Mohit Iyyer.
on soft skills. Labour economics, 19(4):451–464. 2021. The perils of using Mechanical Turk to evalu-
ate open-ended text generation. In Proceedings of the
Jennifer A Hoffmann, Megan M Attridge, Michael S 2021 Conference on Empirical Methods in Natural
Carroll, Norma-Jean E Simon, Andrew F Beck, and Language Processing, pages 1265–1285, Online and
Elizabeth R Alpern. 2023. Association of youth sui- Punta Cana, Dominican Republic. Association for
cides and county-level mental health professional Computational Linguistics.
Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, the tenth acm conference on learning@ scale, pages
Caiming Xiong, and Richard Socher. 2019. Ctrl: A 226–236.
conditional transformer language model for control-
lable generation. ArXiv preprint, abs/1909.05858. David Matz and Noam Ebner. 2010. Using role-play in
online negotiation teaching. In Venturing beyond the
Omar Khattab, Keshav Santhanam, Xiang Lisa classroom, pages 293–317. DRI Press.
Li, David Hall, Percy Liang, Christopher Potts,
and Matei Zaharia. 2022. Demonstrate-search- Brendan McCormack and Tanya V McCance. 2006. De-
predict: Composing retrieval and language mod- velopment of a framework for person-centred nursing.
els for knowledge-intensive nlp. ArXiv preprint, Journal of advanced nursing, 56(5):472–479.
abs/2212.14024.
Norio Mishima, Shinya Kubota, and Shoji Nagata. 2000.
J Peter Kincaid, Roger Hamilton, Ronald W Tarr, and The development of a questionnaire to assess the
Harshal Sangani. 2003. Simulation in education and attitude of active listening. Journal of Occupational
training. Applied system simulation: methodologies Health, 42(3):111–118.
and applications, pages 437–456.
Richard C Mohs and Nigel H Greig. 2017. Drug dis-
Diane B Kraft. 2014. Creac in the real world. Clev. St. covery and development: Role of basic biological
L. Rev., 63:567. research. Alzheimer’s & Dementia: Translational
Research & Clinical Interventions, 3(4):651–657.
Kelly YL Ku and Irene T Ho. 2010. Metacognitive
strategies that enhance critical thinking. Metacogni- Britta K Morris-Rothschild and Marla R Brassard. 2006.
tion and learning, 5:251–267. Teachers’ conflict management styles: The role of at-
tachment styles and classroom management efficacy.
Deuksin Kwon, Sunwoo Lee, Ki Hyun Kim, Seojin Journal of school psychology, 44(2):105–121.
Lee, Taeyoon Kim, and Eric Davis. 2023. What,
when, and how to ground: Designing user persona- TB Moyers, JK Manuel, D Ernst, T Moyers, J Manuel,
aware conversational agents for engaging dialogue. D Ernst, and C Fortini. 2014. Motivational interview-
In Proceedings of the 61st Annual Meeting of the ing treatment integrity coding manual 4.1 (miti 4.1).
Association for Computational Linguistics (Volume Unpublished manual.
5: Industry Track), pages 707–719.
Michael J. Muller and Sarah Kuhn. 1993. Participatory
Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik- design. Commun. ACM, 36(6):24–28.
tus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Sheshera Mysore, Zhuoran Lu, Mengting Wan,
Tim Rocktäschel, Sebastian Riedel, and Douwe Longqi Yang, Steve Menezes, Tina Baghaee, Em-
Kiela. 2020. Retrieval-augmented generation for manuel Barajas Gonzalez, Jennifer Neville, and Tara
knowledge-intensive NLP tasks. In Advances in Neu- Safavi. 2023. Pearl: Personalizing large language
ral Information Processing Systems 33: Annual Con- model writing assistants with generation-calibrated
ference on Neural Information Processing Systems retrievers. ArXiv preprint, abs/2311.09180.
2020, NeurIPS 2020, December 6-12, 2020, virtual.
Patricia B Nemec, Amy Cottone Spagnolo, and
Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, Anne Sullivan Soydan. 2017. Can you hear me now?
and Michael Bendersky. 2023. Automatic prompt teaching listening skills. Psychiatric rehabilitation
rewriting for personalized text generation. ArXiv journal, 40(4):415.
preprint, abs/2310.00152.
Galor Oded. 2011. Inequality, human capital formation,
Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. and the process of development. In Handbook of the
Truthfulqa: Measuring how models mimic human Economics of Education, volume 4, pages 441–493.
falsehoods. ArXiv preprint, abs/2109.07958. Elsevier.

David B Lipsky, Ronald Leroy Seeber, and Richard D Philip Oreopoulos and Kjell G Salvanes. 2011. Price-
Fincher. 2003. Emerging systems for managing work- less: The nonpecuniary benefits of schooling. Jour-
place conflict: Lessons from American corporations nal of Economic perspectives, 25(1):159–184.
for managers and dispute resolution professionals,
volume 18. Jossey-Bass San Francisco. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran- Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
jape, Michele Bevilacqua, Fabio Petroni, and Percy 2022. Training language models to follow instruc-
Liang. 2024. Lost in the middle: How language mod- tions with human feedback. Advances in Neural
els use long contexts. Transactions of the Association Information Processing Systems, 35:27730–27744.
for Computational Linguistics, 12:157–173.
Sarah M Ovink and Brian D Veazey. 2011. More than
Julia M Markel, Steven G Opferman, James A Lan- “getting us through:” a case study in cultural capital
day, and Chris Piech. 2023. Gpteach: Interactive ta enrichment of underrepresented minority undergrad-
training with gpt-based students. In Proceedings of uates. Research in higher education, 52:370–394.
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- Seth J Schwartz, Byron L Zamboanga, and Robert S
ith Ringel Morris, Percy Liang, and Michael S Bern- Weisskirch. 2008. Broadening the study of the self:
stein. 2023. Generative agents: Interactive simulacra Integrating the study of personal identity and cultural
of human behavior. In Proceedings of the 36th An- identity. Social and personality psychology compass,
nual ACM Symposium on User Interface Software 2(2):635–651.
and Technology, pages 1–22.
Teresa E Seeman, Tina M Lusignolo, Marilyn Albert,
Joon Sung Park, Lindsay Popowski, Carrie Cai, Mered- and Lisa Berkman. 2001. Social relationships, social
ith Ringel Morris, Percy Liang, and Michael S Bern- support, and patterns of cognitive aging in healthy,
stein. 2022. Social simulacra: Creating populated high-functioning older adults: Macarthur studies of
prototypes for social computing systems. In Proceed- successful aging. Health psychology, 20(4):243.
ings of the 35th Annual ACM Symposium on User
Interface Software and Technology, pages 1–18. Omar Shaikh, Valentino Chai, Michele J Gelfand, Diyi
Yang, and Michael S Bernstein. 2023a. Rehearsal:
Savvas Petridis, Ben Wedin, James Wexler, Aaron Dons- Simulating conflict to teach conflict resolution. ArXiv
bach, Mahima Pushkarna, Nitesh Goyal, Carrie J Cai, preprint, abs/2309.12309.
and Michael Terry. 2023. Constitutionmaker: Inter-
Omar Shaikh, Kristina Gligorić, Ashna Khetan,
actively critiquing large language models by con-
Matthias Gerstgrasser, Diyi Yang, and Dan Jurafsky.
verting feedback into principles. ArXiv preprint,
2023b. Grounding or guesswork? large language
abs/2310.15428.
models are presumptive grounders. ArXiv preprint,
abs/2311.09144.
Inioluwa Deborah Raji, Andrew Smart, Rebecca N
White, Margaret Mitchell, Timnit Gebru, Ben Omar Shaikh, Hongxin Zhang, William Held, Michael
Hutchinson, Jamila Smith-Loud, Daniel Theron, and Bernstein, and Diyi Yang. 2022. On second thought,
Parker Barnes. 2020. Closing the ai accountability let’s not think step by step! bias and toxicity in zero-
gap: Defining an end-to-end framework for internal shot reasoning. ArXiv preprint, abs/2212.08061.
algorithmic auditing. In Proceedings of the 2020 con-
ference on fairness, accountability, and transparency, Ashish Sharma, Inna W Lin, Adam S Miner, David C
pages 33–44. Atkins, and Tim Althoff. 2023. Human–ai collabo-
ration enables more empathic conversations in text-
Rosemary P Ramsey and Ravipreet S Sohi. 1997. Lis- based peer-to-peer mental health support. Nature
tening to your customers: The impact of perceived Machine Intelligence, 5(1):46–57.
salesperson listening behavior on relationship out-
comes. Journal of the Academy of marketing Science, Ashish Sharma, Sudha Rao, Chris Brockett, Akanksha
25:127–137. Malhotra, Nebojsa Jojic, and William B Dolan. 2024.
Investigating agency of llms in human-ai collabora-
Josephine M Randel, Barbara A Morris, C Douglas Wet- tion tasks. In Proceedings of the 18th Conference of
zel, and Betty V Whitehill. 1992. The effectiveness the European Chapter of the Association for Compu-
of games for educational purposes: A review of re- tational Linguistics (Volume 1: Long Papers), pages
cent research. Simulation & gaming, 23(3):261–276. 1968–1987.

Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Noah Shinn, Beck Labash, and Ashwin Gopinath.
Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and 2023. Reflexion: an autonomous agent with dy-
Haifeng Wang. 2024. Bases: Large-scale web search namic memory and self-reflection. ArXiv preprint,
user simulation with large language model based abs/2303.11366.
agents. ArXiv preprint, abs/2402.17505.
Ben Shneiderman. 1983. Direct manipulation: A
Carl Ransom Rogers and Richard Evans Farson. 1957. step beyond programming languages. Computer,
Active listening. Industrial Relations Center, the Uni- 16(08):57–69.
versity of Chicago. Kurt Shuster, Mojtaba Komeili, Leonard Adolphs,
Stephen Roller, Arthur Szlam, and Jason Weston.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta 2022a. Language models that seek for knowl-
Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- edge: Modular search & generation for dialogue and
moyer, Nicola Cancedda, and Thomas Scialom. 2024. prompt completion. ArXiv preprint, abs/2203.13224.
Toolformer: Language models can teach themselves
to use tools. Advances in Neural Information Pro- Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju,
cessing Systems, 36. Eric Michael Smith, Stephen Roller, Megan Ung,
Moya Chen, Kushal Arora, Joshua Lane, et al. 2022b.
Donald A Schon and Vincent DeSanctis. 1986. The Blenderbot 3: a deployed conversational agent that
reflective practitioner: How professionals think in continually learns to responsibly engage. ArXiv
action. preprint, abs/2208.03188.

Douglas Schuler and Aki Namioka. 1993. Participatory Joseph William Singer. 1988. Persuasion. Mich. L. Rev.,
design: Principles and practices. CRC Press. 87:2442.
Adam Smith. 2006. Cognitive empathy and emotional Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao
empathy in human behavior and evolution. The Psy- Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang,
chological Record, 56(1):3–21. Xu Chen, Yankai Lin, et al. 2024a. A survey on large
language model based autonomous agents. Frontiers
Stephen D Sorden. 2012. The cognitive theory of multi- of Computer Science, 18(6):1–26.
media learning. Handbook of educational theories,
1(2012):1–22. Rose Wang, Pawan Wirawarn, Omar Khattab, Noah
Goodman, and Dorottya Demszky. 2024b. Back-
Ira Strumwasser, Nitin V Paranjpe, Marianne Udow, tracing: Retrieving the cause of the query. In Find-
David Share, Mary Wisgerhof, David L Ronis, Char- ings of the Association for Computational Linguis-
lotte Bartzack, and Ali N Saad. 1991. Appropriate- tics: EACL 2024, pages 722–735, St. Julian’s, Malta.
ness of psychiatric and substance abuse hospitaliza- Association for Computational Linguistics.
tion: implications for payment and utilization man-
agement. Medical Care, pages AS77–AS90. Jason Weston and Sainbayar Sukhbaatar. 2023. System
2 attention (is something you might need too). ArXiv
Milind Tambe, W Lewis Johnson, Randolph M Jones, preprint, abs/2311.11829.
Frank Koss, John E Laird, Paul S Rosenbloom, and
Karl Schwamb. 1995. Intelligent agents for interac- Greg Wiggan, Delphia Smith, and Marcia J Watson-
tive simulation environments. AI magazine, 16(1):15– Vandiver. 2021. The national teacher shortage, urban
15. education and the cognitive sociology of labor. The
Urban Review, 53:43–75.
Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan
Liu, Bing Yin, and Meng Jiang. 2024. Democ- H Wilson, Paul Daugherty, and Nicola Bianzino. 2017.
ratizing large language models via personalized The jobs that artificial intelligence will create. MIT
parameter-efficient fine-tuning. ArXiv preprint, Sloan Management Review Summer.
abs/2402.04401.
Laura Wray-Lake and Amy K Syvertsen. 2011. The
Donald M Taylor and Esther Usborne. 2010. When i developmental roots of social responsibility in child-
know who “we” are, i can be “me”: The primary hood and adolescence. New directions for child and
role of cultural identity clarity for psychological well- adolescent development, 2011(134):11–25.
being. Transcultural psychiatry, 47(1):93–111. Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff
Gray, Alejandra Molina, Michael Terry, and Carrie J
Lucas Theis, Aäron van den Oord, and Matthias Bethge. Cai. 2022. Promptchainer: Chaining large language
2016. A note on the evaluation of generative models. model prompts through visual programming. In CHI
In 4th International Conference on Learning Repre- Conference on Human Factors in Computing Systems
sentations, ICLR 2016, San Juan, Puerto Rico, May Extended Abstracts, pages 1–10.
2-4, 2016, Conference Track Proceedings.
Yuwei Wu, Xuezhe Ma, and Diyi Yang. 2021. Personal-
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam ized response generation via generative split memory
Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, network. In Proceedings of the 2021 Conference of
Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. the North American Chapter of the Association for
2022. Lamda: Language models for dialog applica- Computational Linguistics: Human Language Tech-
tions. ArXiv preprint, abs/2201.08239. nologies, pages 1956–1970, Online. Association for
Computational Linguistics.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Jing Xu, Arthur Szlam, and Jason Weston. 2021. Be-
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti yond goldfish memory: Long-term open-domain con-
Bhosale, et al. 2023. Llama 2: Open founda- versation. ArXiv preprint, abs/2107.07567.
tion and fine-tuned chat models. ArXiv preprint,
abs/2307.09288. Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xi-
aolong Wang, Weidong Liu, and Yang Liu. 2023.
Cees PM van der Vleuten and David B Swanson. 1990. Exploring large language models for communica-
Assessment of clinical skills with standardized pa- tion games: An empirical study on werewolf. ArXiv
tients: state of the art. Teaching and Learning in preprint, abs/2309.04658.
Medicine: An International Journal, 2(2):58–76.
Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan
Evert Van der Vliert. 2013. Complex interpersonal Klein. 2022. Re3: Generating longer stories with
conflict behaviour: Theoretical frontiers. Psychology recursive reprompting and revision. ArXiv preprint,
Press. abs/2210.06774.

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Tom Griffiths, Yuan Cao, and Karthik Narasimhan.
Anima Anandkumar. 2023. Voyager: An open-ended 2024. Tree of thoughts: Deliberate problem solving
embodied agent with large language models. ArXiv with large language models. Advances in Neural
preprint, abs/2305.16291. Information Processing Systems, 36.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
React: Synergizing reasoning and acting in language
models. ArXiv preprint, abs/2210.03629.
Juping Yu and Maggie Kirk. 2009. Evaluation of empa-
thy measurement tools in nursing: systematic review.
Journal of advanced nursing, 65(9):1790–1806.
Jintian Zhang, Xin Xu, and Shumin Deng. 2023. Ex-
ploring collaboration mechanisms for llm agents:
A social psychology view. ArXiv preprint,
abs/2310.02124.
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu
Lin, Yong-Jin Liu, and Gao Huang. 2024. Expel:
Llm agents are experiential learners. In Proceedings
of the AAAI Conference on Artificial Intelligence,
volume 38, pages 19632–19642.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024.
Judging llm-as-a-judge with mt-bench and chatbot
arena. Advances in Neural Information Processing
Systems, 36.
Xuhui Zhou, Zhe Su, Tiwalayo Eisape, Hyunwoo Kim,
and Maarten Sap. 2024a. Is this the real life? is
this just fantasy? the misleading success of simu-
lating social interactions with llms. ArXiv preprint,
abs/2403.05020.
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang,
Haofei Yu, Zhengyang Qi, Louis-Philippe Morency,
Yonatan Bisk, Daniel Fried, Graham Neubig, et al.
2023. Sotopia: Interactive evaluation for social
intelligence in language agents. ArXiv preprint,
abs/2310.11667.

Zhilun Zhou, Yuming Lin, Depeng Jin, and Yong Li.


2024b. Large language model for participatory urban
planning. ArXiv preprint, abs/2402.17161.

You might also like