Professional Documents
Culture Documents
Hostetter Et Al Preprint ChatGPT
Hostetter Et Al Preprint ChatGPT
Autumn B. Hostetter, Natalie Call, Grace Frazier, Tristan James, Cassandra Linnertz,
Abstract
Artificial Intelligence (AI) can write poetry and news articles that go undetected by human
readers. Can students use AI to write college assignments that go undetected by their
professors? Past and current perceptions of AI in education have differed; while some people
view AI as a tool, others view it as a threat to learning. We surveyed 83 students and 82 faculty,
generated by the AI-chatbot, ChatGPT-3. We found that neither faculty nor students could
detect AI-generated writing at above chance levels. Faculty and students had similar opinions
on the ethicality of various uses of AI technology and how much these uses are likely to
compromise learning. Faculty reported a high level of concern regarding the potential effects
that AI could have on their pedagogical practices. Prior experience with ChatGPT-3 and
analyzing the structure and organization of the response was found to improve detection ability
in faculty, suggesting that increased exposure and domain-specific analysis may be beneficial in
means of assessing what students know about a topic, and writing-to-learn is often
evidence that writing can both improve conceptual understanding (Gingerich et al., 2014) and
strengthen memory for information (Spirgel & Delaney, 2016). Further, developing strong
writing skills is a learning objective in and of itself at many colleges and universities, and writing
skills are often cited by employers as a desirable quality in new hires (National Association of
Colleges and Employers, 2022). The best way to learn to write is by having frequent
opportunities to write and then receiving timely, focused feedback on that writing (Kellogg &
Raulerson, 2007). The advent of artificial intelligence (AI) that can produce writing for a student
has introduced concern that such technology could be abused by students to undermine the
usefulness of writing assignments as a pedagogical tool (e.g., Huang, 2023). The purpose of the
present study is to assess the abilities of both students and faculty at detecting the use of AI in
student writing as well as to gauge current perceptions by students and faculty about the use of
AI in higher education.
is intentional (Levine & Pazdernik, 2018). By some estimates, over 90% of students self-report
that they have engaged in a dishonest academic behavior at least one time during their college
career (Hard et al., 2006), though most students report that they have done so only rarely and
think that other students are more likely to plagiarize than they are (Fish & Hura, 2013).
Similarly, faculty agree that plagiarism is a problem, though they may think that it is more of a
4
problem nationwide than in their courses (Liddell & Fong, 2005). Software like Turnitin has
been beneficial to faculty and universities in preventing instances of plagiarism; however, such
technology works by comparing a submitted text to text that is available on the web or in a
database of writing samples (e.g., assignments submitted by previous students in the course).
Such software cannot detect plagiarism in papers with original text written by a ghost writer
other than the student, whether that ghost writer be the student’s parent, a friend, a
One such chatbot, ChatGPT-3, was released by the company OpenAI in November of
2022 garnering significant media attention. ChatGPT can generate human-like text about a wide
variety of topics, as well as learn from and incorporate information given in a particular
conversation (Aydın & Karaarslan, 2023). If you ask ChatGPT-3 what it has the capability to do,
it will tell you, “I am capable of a wide range of natural language processing tasks, such as
language translation, text summarization, question answering, and text generation. I can also
Within weeks of the public release, concerns regarding ChatGPT’s potential use by
college students emerged, with Stephen Marche claiming in The Atlantic that “the college essay
is dead” (Marche, 2022). In the months that followed, the issue has been a frequent topic of
discussion in many academic communities (e.g., Alby, 2023; McMurtrie, 2023; Mintz, 2023),
though there is disagreement about how concerned faculty should really be. For example, some
have claimed that current limitations of the program, including its inability to cite its sources
and a willingness to include information that is factually incorrect, will still demand that a
5
student attempting to use the program apply critical thinking to what it produces if they want
It is thus a pressing issue whether professors can detect writing as having been
produced by AI rather than by a student. A few studies have examined whether AI can escape
detection in domains other than the college classroom (Clerwall, 2014; Kobis & Mossink, 2021).
For example, Clerwall (2014) presented participants with a news article written by a journalist
and another that was AI-generated. Clerwall found that participants perceived the two texts
similarly and could not reliably discern between the human-generated and AI-generated texts.
AI has also fooled humans with its ability to write more creative texts. Kobis and Mossink
(2021) found that participants were unable to reliably detect which of two poems had been
generated by AI, regardless of whether the human-generated poems had been written by
professional poets or by novices and regardless of whether the participants were incentivized
to make a correct guess or not. However, Kobis and Mossink did find that the AI-generated
poems most likely to pass as human-generated were those that had been selected by humans;
when random poems generated by the AI were tested against human-generated poems, the AI
poems did not fare as well. It seems that AI can fool humans with its ability to mimic human-like
text in both analytical and creative domains, especially when a human has been involved in
selecting which AI-generated texts are best, suggesting that a student who applies some
student-written texts, we also aim to assess the current state of opinion among faculty and
students about the appropriateness of using AI to assist with college-level writing. One
6
rethink the goals of writing assignments—perhaps by thinking about how critical thinking and
writing skills can still be developed and displayed even with the use of AI to do some of the
student’s drafting (Grobe, 2023). Is this a direction that college faculty as a whole are willing to
consider, or is there a general consensus that there is no place for AI in college writing?
Some research suggests that people may have a general aversion to the use of AI in
writing (e.g., Wadell, 2018). Humans tend to prefer texts written by humans over texts written
by computers. Graefe et al. (2018) varied whether participants were told that a news article
participants found the computer-written articles to have more credibility and expertise, but less
This tendency to prefer texts written by humans over texts written by computers is
related to a more general term called algorithm aversion (Burton et al., 2019), in which humans
show unconscious and conscious reluctance to rely on the decisions made by algorithms
compared to a human agent, even though algorithms outperform humans in many domains
(Castelo & Ward, 2021). Algorithm aversion is found in a wide variety of contexts from the
health care system (Heatherly et al., 2023) to college admissions (Wenzelburger & Hartmann,
2022). Research shows that trust in algorithms can be increased by giving people a small degree
of control over the algorithms output (Castelo & Ward, 2021). If people can incorporate their
own input and have a say in the ultimate decision, their aversion weakens (Dietvorst et al.,
2014). Thus, an open question is whether people might be more open to AI-generated texts if
7
they see a human as the one still primarily in control of the writing process. For example, in a
college setting, a student may be able to use AI to make suggestions for the organization,
wording, or even ideas to be expressed in a paper, while still taking ownership of which
suggestions to implement in the final draft and making sure the ideas are well-supported with
Indeed, in at least some situations, college students and instructors have reported that
technology can be a useful learning and writing tool. For example, writers are expected to use
an automated spell-checker on their work, and persistent spelling errors that suggest the
author did not do so lead to negative perceptions of the author’s abilities (Figueredo &
Varnhagen, 2005). Chang et al. (2021) found that students learning English as a foreign
language who used the program Grammarly over the course of a semester showed larger
improvements in their English writing than students who did not use such a program, and the
students reported appreciating the instant grammar correction that the program provided.
Positive feelings toward Grammarly are shared by students regardless of their level of English
proficiency (Fahmi & Cahyono, 2021), and students find AI grammar correction to be a useful
source of feedback when their professor is not present (Sumakul et al., 2021). Nonetheless,
there is perhaps a difference between programs such as spell-checkers and Grammarly that
merely flag errors with suggested changes and AI programs such as ChatGPT that can compose
entire sentences and paragraphs for a student. Indeed, Keles and Aydin (2021) found that
university students in their sample generally held negative perceptions of artificial intelligence.
The willingness of students and faculty to accept AI as a tool in the writing process may
require balancing a recognition of its potential against a general sense of fear. Kim and Kim
8
(2022) found that after using an AI-enhanced scaffolding system (AISS) to aid scientific writing,
most teachers showed positive reactions as they recognized its ability to provide strong writing
examples, personalized feedback, and suggestions for supporting sources that could advance
students’ self-guided learning and problem-solving skills. However, the teachers also expressed
hesitation about adopting AISS, as they were concerned that it could make their own role in the
classroom obsolete. A similar sentiment was found by Wood et al. (2021), in which both
revolutionize medical practice and improve facets of healthcare, while also expressing
significant concerns about the role of physicians or other medical specialties being replaced by
AI in the future. These findings suggest that while there is growing acceptance of AI’s
capabilities, there is also hesitation regarding what changes an AI-integrated future could bring.
The current study had two aims. First, we aimed to assess whether an AI-generated text
can be detected by students and faculty and to examine whether there are specific ways in
which an AI-generated text differs from student-generated texts. Participants were given four
written responses and asked to rate them based on various categories. They were unaware that
one response had been generated by ChatGPT-3. Then, participants were informed of the true
purpose of the study and asked to choose the response they believed was generated by AI, to
report their confidence, and to offer a rationale for their choice. We compared how the AI-
generated sample was rated compared to the student-generated samples and considered
whether focusing on particular features of the texts increased the probability of correctly
choosing the AI-generated text. Second, we aimed to assess current perceptions of students
and faculty regarding the use of AI in college-level writing. Toward this aim, students and
9
faculty considered nine scenarios that involved a student using technology to assist with college
writing, some of which are in common usage (e.g., spellcheck) and some of which involve using
ChatGPT to help with the writing process in various ways. Students and faculty rated how
ethical they found each scenario as well as how much they thought learning was compromised
in each scenario. This study was largely exploratory; rather than testing any specific hypotheses
about student and faculty perceptions of AI, we sought to examine the current state of thinking
Method
Participants
The final sample included data from 165 participants (82 faculty and 83 students).
Additional data from two students were collected but discarded because they spent fewer than
5 minutes on the survey, chose the same point on the scale for all rating tasks, and did not
write anything for the open-ended responses. The final student sample was comprised of 83
participants (36 men, 44 women, and 3 nonbinary individuals) with the average age of 19 years.
The majority of students were white (67%), with Hispanic/Latinx (10%), Asian (8.4%), African
American (5%), mixed race (5%), and Middle Eastern (2%) ethnicities also represented. All
students were enrolled at a small liberal arts college in the midwest, and they were recruited
through announcements in their psychology courses, as well as via word of mouth and social
media posts. Students who were enrolled in a Psychology course were incentivized with credit
The faculty sample was comprised of 25 men, 52 women, and 1 gender queer, with an
average age of 41 years. Of the faculty who reported their ethnicity, the majority were white
10
(82%), with Asian (9%), mixed race (4%), Hispanic/Latinx (3%), and African American (2%) also
represented. Faculty were recruited via email invitation and offered a $10 gift card as an
incentive for completing the survey. Email invitations were sent to faculty across campus at a
small liberal arts institution and were also sent to professional contacts and acquaintances of
the principal investigator at a range of institutions. Further, each email invitation included an
appeal for the faculty member to share with others at their institution or in their professional
network who they thought might be interested. The faculty participants reported that they had
been teaching in higher education for an average of 12 years (SD = 9.41). The majority (79.27%)
identified their primary discipline as Psychology, though faculty from other Social Sciences
(4.88%), Natural Sciences (3.66%), Humanities (6.10%), Foreign Languages (4.88%), and Fine
Arts (1.22%) were also represented. Faculty participants teach at a variety of institutions, with
30.49% teaching at colleges offering Bachelors’ Degrees only, 15.85% at institutions offering
Bachelors’ and Masters’ degrees, and 53.66% at institutions also offering Doctorate degrees.
Materials
Writing Samples
When choosing a topic for our writing samples, several factors were considered.
Because the goal was to collect data from faculty with expertise in a range of disciplines as well
as students, we wanted a topic for our writing samples that would be comprehensible to a wide
audience without specialized expertise in any particular area. Second, we wanted a topic that
could be addressed in relatively few words so that participants could read and rate four writing
samples without committing more than about 15 minutes to the study. Third, we wanted a
11
topic that involved providing personal examples of a concept; while previous studies have
examined AI’s capabilities with fact-based writing such as journalism (e.g., Clerwall, 2014) and
with creative writing such as poetry (e.g., Kobis & Mossink, 2022), we know of no studies that
have examined its capabilities with topics that involve personal reflection. Reflective writing has
been shown to be an effective pedagogical tool (McGuire et al., 2009) that emphasizes
et al., 2001). Finally, because ChatGPT is known to have difficulties producing accurate citations
(Grobe, 2023), we intentionally selected a topic that could be addressed adequately without
any citations to avoid providing an obvious clue as to which sample was AI-generated.
Toward these goals, we chose the following prompt: “Think about how your personality
affects your study habits. Specifically, does being high or low on a particular personality
dimension affect how likely you are to engage in active recall when you are studying? Be sure to
explain these concepts and provide examples from your life.” We first had six students in an
upper-level Psychology course respond to this prompt, and we chose three of the responses to
use as our student-generated responses. In choosing which student responses to use, we aimed
represented. The student samples were each approximately 200 words and can be seen in their
entirety in Appendix A.
the prompt along with instructions to “respond as a college student and use 200 words.” We
did this on six different computers, resulting in six different AI-generated responses. The
authors of the study then discussed the six responses and chose the one that we thought
12
sounded most representative of a college student. In this way, we followed Kobis and Mossink
(2021) in that we chose the AI-sample to use based on human decision of which one was
“best,” rather than choosing randomly. Although this is likely to increase the chances that the
AI-generated text can pass as student writing without detection, it is also likely similar to what a
savvy student would do who is trying to pass AI-generated text off as their own writing. The
Below each of the four writing samples, participants rated their level of agreement to
five statements regarding the quality of the work. These ratings were on a five-point scale
(1=strongly disagree, 2=disagree, 3=neither agree nor disagree, 4=agree, 5=strongly agree). All
the statements were positively worded; higher scores on the statements corresponded to more
positive feelings about the sample. The statements regarded perceptions of the writing in terms
of its grammar and mechanics (e.g. "the student demonstrated correct grammar and writing
mechanics”), organization and flow, and amount of time and effort reflected. We also included
two statements specific to the prompt that addressed the quality of the personal experiences
included (e.g. "the student provided and connected the concepts to personal experience from
their life”) and the quality of the connections made between personality and active recall.
Participants were presented with a survey about the acceptable use of technology in
college writing. Nine scenarios describing a student using technology to assist with writing in a
variety of ways were created for the purposes of this study (see Table 1). We generated these
scenarios to represent some situations which we thought most people would be familiar with
13
and likely see as acceptable (e.g., using spell-check to flag typos in a paper) and some situations
which we thought most people would consider clear plagiarism violations (e.g., copying content
from Google with no citation). In addition, five of the scenarios described a potential way in
which AI (e.g., ChatGPT) could be used to assist with writing. These scenarios ranged in severity
from using the AI to generate an outline or section of a paper that is then developed or
integrated into the rest of the student’s own work to using the AI to generate an entire paper
Participants answered two questions about each scenario. The first question was How
ethical is this use of technology? Participants responded on a 4-point scale from 1 = Completely
Unethical to 4 = Completely Ethical. A higher score indicates higher acceptance for the use. The
second question was How much does using technology in this way compromise what the
student learned from the assignment? Participants responded on a 4-point scale from 1 =
learning indicates the belief that the technology use hinders learning.
Table 1
1. A student uses a spell checker to flag spelling mistakes and typos in their essay.
2. A student uses Grammarly to review the style and clarity of their essay and to suggest
edits.
3. A student uses a citation generator to make a reference list of the sources they cited
in their paper.
4. A student uses Google to look up a topic and then copies and pastes the answer they
find into their paper without citing the source.
5. A student uses an Artificial Intelligence website (e.g., ChatGPT) to write a section of a
paper. They copy and paste this text into their paper, integrating it with their own
writing.
14
The faculty participants rated their agreement with each of six statements about their
concern that AI would be used by students and whether they intend to change their teaching
practices as a result. The statements can be seen in Table 2. Faculty indicated their agreement
with each statement on a 5-point scale of strongly disagree (1) to strongly agree (5). After they
rated each statement, the faculty participants were given an opportunity to type an open-
ended response to the prompt “is there anything else you would like to tell us about your
Procedure
Participants began the study by following a link to a Qualtrics survey. They first read a
short description of the study that described it as being about how college students and
instructors think about what constitutes “good” writing. They then gave certified that they
class in which students had previously learned about the concept of “active recall,” defined as
bringing an idea to the forefront of one’s thinking. The students in the course are now learning
15
about the Big 5 Personality traits and how they can affect people’s life and behavior. The
instructor of the course has asked students to prepare a brief reflection connecting these ideas
to one another. The exact wording of the prompt was then given, and participants were told
that they would see four student responses to the prompt on the following pages, which they
should consider carefully for how well they felt it addressed the prompt. The participant clicked
Next to proceed.
On the four pages that followed, the prompt was repeated at the top of the screen
followed by one of the four writing samples. Underneath the writing sample, the five
statements about writing quality were given and participants indicated their agreement with
each. Participants clicked next when they were ready to submit their answers and view the next
writing sample. The four writing samples were presented in a random order, and participants
were not able to go back and view a previous sample or their responses.
After the participants read and rated the quality of each writing sample, they were told
that “three of the four responses you just read were written by real students and one was
written by an AI language machine, ChatGPT.” The use and function of ChatGPT was described,
and then the real purpose of the study was revealed: to understand students’ and faculty’s
ability to detect AI in student writing and gauge their beliefs around the acceptable use of
technology in college writing. On the following page, the prompt was presented again along
with the four writing samples in a random order. The participants selected which of the four
samples they believed was most likely generated by ChatGPT. The participants then rated their
confidence on a 4-point scale (1 = Not all confident, 2 = somewhat confident, 3 = confident, and
16
4 = extremely confident), and were given the opportunity to type a description of their
Next, participants were presented with the survey about acceptable use of technology.
Participants were instructed to imagine that each described use of technology was occurring for
a required, graded assignment in the student’s college course. Each scenario was presented
individually on a page with the ethicality and learning questions underneath. The nine scenarios
were presented in the same fixed order (as shown in Table 1), with the familiar scenarios (spell
check, Grammarly, citation generator, and google) being presented before the scenarios
involving ChatGPT. Participants clicked next after indicating their choices for each scenario and
being used by students and whether it would affect their teaching and grading practices. They
were also given an opportunity to provide any additional information regarding the potential
given to the faculty and student participants. All participants were asked their gender, age,
ethnicity, whether they were familiar with ChatGPT, whether they had used ChatGPT, whether
they were familiar with the concept of “active recall”, and whether they were familiar with the
concept of “Big 5 Personality traits.” Faculty were also asked about their academic discipline,
the type of institution where they taught, and their years of teaching experiene. At the
conclusion of the study, participants were told which of the writing samples had been
generated by ChatGPT and given an opportunity to follow a link to a separate form where they
17
could enter their contact information to receive extra credit or gift card compensation. The
Coding
We coded the qualitative responses about why each participant selected a certain
passage as being AI-generated. Each response was coded for whether it included each of seven
types of reasons. First, a response was coded as mentioning Personalization if the participant
stated that the sample they chose did not contain personal examples or that the examples
included were somehow less specific or less personal than in the other samples. Responses
were coded as Structure if the participant mentioned that the way the response was divided
into paragraphs or sections was an indicator. Organization was assigned to responses that
mentioned that the way the ideas in the sample were organized was somehow relevant. For
example, some participants described that the passage they chose felt like it had been written
“from a template”, that it sounded choppy or abrupt, that it started with definitions before
getting into specific details, or that it had irregular or poor flow of ideas. A response was coded
as Tone if the participant mentioned that the sample sounded robotic or had no human
personality to it, used technical-sounding words, or was repetitive. Word Choice was assigned
to responses that specifically highlighted a particular word or phrase. For example, “as a college
student”, was the most common phrase mentioned, other responses called out specific words
like “in times” and “level-headed” as seeming strange). Responses coded as Grammar either
cited that the chosen sample had notable grammatical errors, or alternatively that it was
completely free of grammatical errors. And finally, responses assigned to Other had other
reasons for selecting the passage that did not fit with any of the predetermined categories
18
(responses from participants who just selected a passage randomly but could not cite a specific
reason for selecting it). These categories were not mutually exclusive; many responses cited
several different reasons for their selection, and had multiple categories assigned to them
accordingly. Additionally, we took note of whether each participant seemed to believe the AI-
generated passage would be the best or the worst of the samples, with an additional unclear
Results
About half of the student participants (53%) had heard of ChatGPT prior to the start of
this study, though only 17% reported that they had used it themselves and only one was aware
that it was the focus of the present study. We found that 69% of student participants were
familiar with active recall and 49% were familiar with the big five personality traits. By contrast,
most faculty participants (94%) had heard of ChatGPT prior to this study. Only 24% reported
that they had used it before, while thirteen participants reported knowing ahead of time that
the study was about ChatGPT. Ninety percent of faculty participants were familiar with the
concept of active recall, and 87% were familiar with the big five personality traits.
sample to be different than the quality of the three student-written samples. Participants rated
how much they agreed with each of five statements regarding the quality of each writing
sample, where higher scores indicate higher perceived quality of the sample. For each
other, and effort), we conducted a repeated measures analysis of variance (ANOVA) that
compared the ratings on that dimension across the four writing samples, followed by post hoc
tests to compare each of the student-written samples to the ChatGPT sample. Post hoc tests
applied a Bonferroni correction for multiple comparisons. The results are shown in Figure 1 and
summarized below.
Participants rated the grammar and writing mechanics of the four samples differently,
F(3, 492) = 46.85, p < .001, η² p = 0.22. The sample written by ChatGPT was rated as having
significantly better grammar and mechanics than student Sample 2, t(164) = 3.01, p = .016, or 3,
t(164) = 11.04, p < .001. The organization and flow of the samples was also rated differently,
F(3, 489) = 16.01, p < .001, η² p = .089, with the ChatGPT sample being rated as having better
organization and flow than student Sample 3, t(163) = 6.55, p < .001. Participants rated the
quality of the personal experiences provided in the four samples differently F(3, 492) = 17.861,
p < .001, η² p = 0.098. The sample written by ChatGPT was rated as having significantly better
personal experiences than student Sample 1, t(164) = 3.58, p = 0.002. However, it was rated as
having significantly worse personal experiences than Sample 2, t(164) = 3.72, p = 0.001. How
well the samples connected the idea of active recall to the Big 5 personality traits was also
rated differently across the samples, F(3, 486) = 6.409, p < .001, η² p =.038, with ChatGPT rated
as having better connections in the text than student Sample 1, t(162) = 3.62, p= 0.002.
Participants rated the time and effort put into writing each of the four samples differently as
well, F(3, 492) = 9.10, p < .001, η² p = 0.053. The sample written by ChatGPT was rated as having
significantly more effort than student Sample 3, t(164) = 5.08, p = .001. Although the ChatGPT
sample was rated differently than some student samples on some dimensions, it was never
20
rated differently than all three of the student samples. This suggests that ChatGPT writing can
effectively blend in with a set of student-generated writing samples; it is neither the worst nor
Figure 1
Note. Participants’ average ratings for each of the five statements regarding the quality of the
writing samples. Higher scores represent higher perceived quality for the writing sample on
that dimension. Error bars represent standard errors of the means. Student samples that differ
significantly from the ChatGPT sample are marked with *.
Faculty and students did not differ in the frequency with which they chose each of the
four samples as being AI-generated, two-way X²(3, N = 165) = 3.62, p = .31. However, for both
faculty and students, the four samples were not chosen with equal frequencies, one-way X²(3,
N = 165) = 30.71, p < .001. Student Sample 2 was rarely chosen as the AI-generated text (7%),
21
while Student Sample 1 was chosen the most often (36%) and Student Sample 3 was chosen at
about chance levels (27%). Importantly, the ChatGPT sample was chosen as the AI-generated
text by only 29% of participants, suggesting that a majority of participants were unable to
We next considered whether prior experience using ChatGPT improved the likelihood of
detecting the correct AI-generated text. Those who reported using ChatGPT prior to this study
were more likely to successfully detect the AI-written sample than those who reported no prior
experience, though the two-way chi-square was not significant by conventional standards, X²(1,
N = 163) = 2.85, p = .09. This trend was driven by the faculty, with 45.0% of faculty who have
used ChatGPT choosing the correct sample compared to only 24.19% of those who have no
experience with ChatGPT choosing it, X²(1, N = 82) = 3.16, p = .08. Further, faculty members
whose primary discipline was Psychology were more likely to detect the AI-sample if they had
experience using Chat-GPT than if they did not, X²(1, N = 65) = 4.76, p = .03. A narrow majority
(54%) of Psychology faculty who had used Chat-GPT in the past were able to identify the Chat-
GPT sample as being AI-written, compared to 23% of Psychology faculty with no experience
using it. In our relatively small sample, experience using ChatGPT almost doubled the likelihood
that a faculty member would be able to successfully choose the ChatGPT-written sample from
among the four choices, and experience seems to have a particular benefit for judging samples
Were participants who correctly identified the AI-generated text more confident in their
choice than those who did not? A 2 (Sample: Student vs. Faculty) x 2 (Correct: No vs. Yes)
between subjects ANOVA was done on the reported confidence in the choice. There was no
22
difference in how confident the participant was depending on whether they chose the correct
sample, F(1, 161) = .011, p = .92, η² p = 6.96 x 10-5. However, there was a difference between
the confidence of students and faculty, F(1,161) = 6.99, p = .009, η² p = 0.042. Students were
more confident with their choice (M = 2.06, SD = .62) than faculty (M = 1.76, SD = .72). There
was no interaction between accuracy and sample, F(1,161) = .104, p = .633, η² p = .001.
Confidence was also unrelated to whether participants had used ChatGPT before the study;
participants who had used ChatGPT were not more confident in their choice (M = 1.97, SD =
0.80) than participants who had never used ChatGPT before (M = 1.90, SD = 0.66), t(161) = .54,
p = 0.59, d = 0.10.
The most commonly-cited reasons for choosing a particular sample as the AI-generated
sample were Tone (49.09%), Organization (41.21%), and Personalization (36.97%). The least
cited reasons were Word Choice (9.70%), Structure (9.09%), and Grammar (6.67%). This pattern
generally held across both students and faculty. Faculty cited tone more than students (59.76%
vs 39.68%); whereas students cited organization (47.62%) and personalization (46.03%) more
choosing correctly with a two-way chi-square for each factor. The results are shown in Figure 2.
Using Organization as a factor for one’s choice significantly increased the chances of choosing
correctly, X²(1, N = 165) = 6.32, p = .012. Of the participants who stated organization as a
reason for selecting a particular sample as the AI sample, 39.71% made the correct choice
(compared to 29% in the overall sample). Using Structure as a factor also seemed to help
23
participants make a correct choice, X²(1, N = 165) = 4.70, p = .03. Participants who cited
structure as a reason for selecting a particular sample were correct 53.33% of the time, though
this finding should be interpreted with some caution given the relatively low number of people
Figure 2
Note. Participants’ ratio of correct to incorrect responses when a parameter was mentioned in their
reasoning for selecting a sample as AI. Samples that are statistically significant are marked with a *.
Ethics. Figure 3 shows student and faculty perceptions of how ethical each of the nine
Faculty) x 9 (Activity) repeated measures ANOVA. We found a significant effect of activity, F(8,
1264) = 446.60, p < .001, η² p = .74, and a significant activity * sample interaction, F(8, 1264) =
24
2.23, p = .02, η² p = .01, but no main effect of sample, F(1, 158) = 0.0008, p = .978, η² p = .00.
Students and faculty differed only in terms of how ethical they found the use of spellcheck,
t(163) = 2.81, p = .01, d = .39, with faculty generally finding spell check more ethically
acceptable than students, though it should be noted that both faculty and students found spell
check to be among the most ethically acceptable of the nine scenarios. As expected,
participants found the three uses of technology that are in common usage (spell check, citation
generator, and Grammarly) to be highly ethical, while finding the use of Google to copy and
paste highly unethical. Of the potential ways to use ChatGPT, only using it to write an entire
paper without citing it was rated as unethical as copying and pasting from Google, t(159) = 1.13,
p > .05. The other four potential uses of ChatGPT were rated as more ethical than copying from
google (all t’s > 5.0, p < .001) but less ethical than Spell check, Grammarly, and citation
generators (all t’s > 14.0, p < .001). Further, these four potential uses of ChatGPT were not all
rated the same; using AI to make an outline for a paper and then writing the paper oneself was
rated as similarly ethical to using AI to write an entire paper but citing the use of the AI in the
paper, t(163) = 0.47, p = .64. Both of these potential uses were rated as more ethical than using
AI to expand an outline or draft (t(163) = 7.57, p < .001 and t(164) = 6.48, p < .001), which was
in turn rated as more ethical than using AI to write a section of a paper that is then
incorporated with the student’s own writing (t(164) = 8.03, p < .001). It seems that both
students and faculty find uses of ChatGPT that pass its output off as the student’s own writing
to be more unethical than uses that either admit to its involvement or that use it only to
generate ideas or outlines without composing the text for the student.
25
Figure 3
Faculty and Student Perceptions of How Ethical it is to use Technology to Assist with Writing
Note. Error bars represent standard errors of the means. Significant differences between
students and faculty are noted with *. For differences in ratings of the various activities, see the
text.
Learning. Figure 4 displays faculty and student perceptions of how much each potential
that activities that were rated as less ethical were rated as more detrimental to learning. We
analyzed perceptions of learning with a 2 (Sample: Student vs. Faculty) x 9 (Activity) repeated
measures ANOVA. There was a significant effect of activity, F(8, 1264) = 193.24 p < .001, η² p
= .55, and a significant activity * sample interaction, F(8, 1264) = 5.06, p < .001, η² p = .03, but
no main effect of sample, F(1, 158) = 0.45, p = .50, η² p = .003. Faculty and students differed in
their perceptions of spellcheck, with faculty finding it less detrimental to a student’s learning
than students, t(163) = 2.57, p = .01. In contrast, faculty found copying and pasting from
26
Google to be more detrimental to students’ learning than students, t(161) = 2.77, p = .006.
Nonetheless, for both faculty and students, spellcheck was among the scenarios that was
thought to compromise learning the least while copying from Google was among those thought
The three uses of technology in common usage (spellcheck, Grammarly, and citation
generator) were thought to compromise learning less than any of the other activities, all t’s >
6.0, p < .001. Using AI to write an entire paper was thought to be the most compromising to
learning, even more compromising than copying and pasting from Google, t(161) = 4.97, p
<.001. Using AI to write a section of a paper and incorporating it with one’s own writing was
seen as compromising learning to a similar degree as using Google to copy and paste, t(162) =
2.0, p > .05. Interestingly, using AI to write a paper but citing the use of AI was seen as less
compromising to learning than using it without citing to write an entire paper or a section of a
paper, and was thought to compromise learning to the same degree as using AI to expand a
paper. Finally, using AI to make an outline for a paper that is then written by the student was
seen as more likely to compromise learning than the technology currently in use (Grammarly,
spellcheck, and citation helpers) but was seen as less likely to compromise learning than any of
the other potential uses for AI (all t’s > 6.0, p < .001). It appears that faculty and students
generally think that using AI to assist with writing is very likely to compromise learning if that AI
is going to be used to write text for the student; using it only to generate ideas is seen as less
problematic. One thing to note about these analyses, however, is that even for the most
accepted technology uses (spellcheck, Grammarly, citation generators), the means were well
above 1; in contrast to the ethics analysis where some uses were rated as “definitely ethical,” it
27
seems that both faculty and students were generally unwilling to say that learning is “definitely
Figure 4
Faculty and Student Perception of How Much Learning is Compromised with Technology
Assisted Writing
Note. Error bars represent standard errors of the means. Significant differences between
students and faculty are noted with *. For differences in ratings of the various activities, see the
text.
Table 2 shows faculty’s average agreement with each of the six statements regarding
their level of concern for AI affecting their teaching. The average agreement with each
statement was compared to the midpoint of the scale (3) with a one-sample t-test. Faculty
agreed that they were concerned about students using AI for writing assignments and that the
28
availability of AI will change the types of assignments they give and how they assess writing
assignments. Faculty were generally not confident in their ability to detect the use of AI in a
student’s writing, and they had strong agreement that it is important to talk with students
Table 2
Faculty Concerns on the Use of AI in the Classroom
Note. Higher scores indicate stronger agreement with each statement. One-sample t-tests are
compared to the mid-point of the scale (3.0) and * indicates p < .001.
Faculty were not in agreement about the potential for AI to be used as a tool for
struggling writers. Interestingly, this item has the largest standard deviation of the six items,
suggesting that its not that most faculty generally felt neutral about it; rather, there is
disagreement among the faculty, with 36% agreeing or strongly agreeing that AI could be used
as a tool to help struggling writers, but 27% agreeing or strongly disagreeing that AI could or
should be used in this way. Interestingly, whether a faculty member agreed or disagreed with
29
this statement was predicted by whether they themselves had prior experience with ChatGPT
or not; 70% of faculty with prior experience agreed or strongly agreed that AI could be used as
a tool for struggling writers compared to only 26% of faculty with no prior experience, X²(4, N =
Discussion
The first goal of the present study was to determine whether students and faculty can
detect an AI-written text when it is presented among a set of student-generated texts. Like past
studies testing humans’ ability to detect AI-generated text in news stories (Clerwall, 2014) or
poetry (Kobis & Mossink, 2021), we found that neither faculty nor students could identify the
AI-generated writing at above chance levels. This suggests that ChatGPT-3 is capable of
producing writing that blends in with real student writing, even when the focus of the writing is
on applying course concepts to one’s own life. Of course, it should be noted that we
intentionally chose the “best” response obtained by ChatGPT from six that it produced; had we
chosen the first response it produced, a random response, or the “worst” response instead of
the “best,” it is quite possible that the ChatGPT sample would have been more detectable, as
Kobis and Mossink found to be the case when they compared a random AI-generated poem to
human-generated poems. Nonetheless, the present results show that it is possible for the
Indeed, the ChatGPT sample did not differ from all three student samples in any of the
effort). Overall, the ChatGPT writing sample was perceived among the best of the samples at
grammar, organization, and apparent effort. Although the ChatGPT sample was rated lower
30
than one of the student samples in terms of the quality of its personal examples, it did as well
as or even better than the other two student samples on this dimension. Thus, when it comes
to identifying a writing sample as being written by AI, there is no single dimension that can be
considered as a definite give away. We suspect that this may be even more true if the AI sample
were included in a larger batch of student writing samples, with even greater variation in the
quality of the student writing. However, there may be some promise in considering multiple
dimensions—that is, the present data suggest that a writing sample that is simultaneously
among the best at grammar and organization while not being as good at providing high-quality
personal examples may warrant closer inspection, perhaps by pasting it into an AI-detector
such as GPTZero (https://gptzero.me/). However, while promising, such technology is still not
perfect, as it will flag human-written work as AI-generated text in at least some situations
(Leong, 2023).
Although the sample written by Chat-GPT-3 did not emerge as being different from the
student samples in terms of the quantitative ratings on any dimension, it is possible that there
is something qualitatively different about the texts that ChatGPT produces. That is, the
organization might not be overall better, but it might nonetheless be qualitatively different than
the student samples. Some evidence for this possibility comes from the reasons cited in
participants’ rationale about why they chose a particular sample as being AI-generated.
Participants who cited focusing on the organization or structure of the response were more
likely to choose the correct sample as being AI-generated than those who did not cite a focus
on organization or structure. For instance, a successful participant noted that “ChatGPT tends
to answer in paragraphs which rules out two options, additionally from my experience it tends
31
to start with definitions and then take a deep dive.” This particular organization may not be
better than other ways of organizing the text, but it may be recognizable, particularly by people
who have some experience using ChatGPT. In a simple 200-word response like our writing
samples, there is no need to necessarily divide the text into distinct paragraphs (only one of our
six student writers did so), but knowing that ChatGPT tends to do that seemed to help people
The second goal of this study was to assess student and faculty perceptions of using AI
to assist with college-level writing. We found that students and faculty had highly similar views
of how ethical various uses of technology were and how much they compromise learning.
Faculty rated spell checker as more ethical and less compromising to learning than students did,
while students thought copying and pasting information from Google was less compromising to
learning than faculty did. These subtle differences aside, students and faculty agree that
common uses of technology like spell check, Grammarly, and citation generators are ethically
acceptable and less compromising to learning than copying and pasting from google.
Some uses of AI were seen as highly problematic by both students and faculty. Using
ChatGPT to write an entire paper was considered unethical and even more detrimental to
learning than copying and pasting from Google. However, citing AI in the paper was considered
more ethically acceptable and less compromising to learning. Results indicated that spellcheck,
Grammarly, and citation generators were seen as statistically more acceptable than any of the
uses of ChatGPT. However, the participants’ greatest aversion to AI use in writing is when AI
composes text for the student, whether it be the entire paper or a section of the paper, and
regardless of whether it is based on the student’s own ideas. Students and faculty were a little
32
more accepting of using AI to produce ideas or an outline that is then expanded by the
student’s own hand. This mirrors existing literature showing that people do not like AI-
generated text if they know that it is AI-generated (e.g., Graefe et al., 2018), perhaps as a result
of algorithm aversion (Burton et al., 2019). Further, this literature suggests that people are
more likely to accept the use of AI if a human also has a hand in the final output or decision
(Dietvorst et al., 2014). When it comes to writing, people strongly object to final text being
Overall, faculty were significantly concerned about the use of AI in the classroom and
for writing assignments. Faculty reported that they had very low confidence in their ability to
detect AI, and indeed, we found no relationship between an individual faculty member’s
success at detecting the AI in our study and their reported confidence that they had chosen
correctly. Further, faculty are concerned about students using AI for assignments and believe
that AI will lead to a change in the types of assessments they give in their classes. Faculty also
agreed that having conversations with students about the acceptable use of AI in their courses
is important.
Faculty were most divided on the topic of whether AI could be a useful tool for
struggling writers, and this division was in part due to the faculty member’s own experience
using ChatGPT. Faculty members with experience using ChatGPT were more likely to agree or
strongly agree that AI could be a useful tool for struggling writers than faculty members with no
prior experience using ChatGPT. This difference could be due to individual differences in
comfort with technology; faculty who are more open to technology could be both more likely
to have tried it out themselves and more receptive to its use by students. On the other hand,
33
the difference could be due to experience with the program giving faculty a better sense of its
capabilities and changing how they think about it, as Kim and Kim (2022) have shown that
teachers who have experience with a particular technology become more open to it. More
research is needed to examine how using ChatGPT may affect faculty’s perceptions of it, but in
the meantime, it seems that any conversation among faculty about ChatGPT may be divided
based on whether they have actually used the technology themselves or not.
There are a few limitations of the study that should be noted. First, in order to keep the
length of the study manageable, we compared one ChatGPT sample to three human samples.
As a result, there was not very much variation in the quality of the writing samples that we
used, and the use of only one sample written by ChatGPT perhaps says more about
participants’ inability to detect this particular sample than an inability to detect AI-generated
prose in general. Another limitation is the context of the writing sample; not all the participants
were familiar with the psychology terms (The Big 5 Personality Traits and Active Recall) used in
the samples, and this could affect their ability to detect AI in the written samples. Similarly, we
tested detection of a particular type of writing (applied reflection) that is often used in
Additionally, we did not include sources in the student and AI-generated writing samples
because ChatGPT-3 has known problems with including accurate sources (Grobe, 2023). This is
a noteworthy limitation because much academic writing does require citations, and it remains
unclear whether faculty or students would be able to detect problems with citations of the sort
that ChatGPT-3 is known to make. Finally, this study was conducted with ChatGPT-3
approximately two months after its public release; as additional iterations of the technology
34
(e.g., ChatGPT-4) are released, its capabilities and people’s opinions about it are likely to
change.
In conclusion, the advent of AI chatbots that can produce passable human text is not the
first technological innovation to threaten the academy, nor will it likely be the last. Faculty
generally feel anxious about implementing new and unfamiliar technologies (Zimmerman,
2006), and our data suggest that Chat-GPT is no exception, perhaps for good reason. It is
difficult to detect when placed among student writing. However, our data also suggest some
potential ways to improve detection, specifically by using Chat-GPT oneself with prompts in
one’s own area of expertise. Familiarity with the organization and structure of the output it
produces seems to aid detection. Moreover, faculty and students are not that far apart in terms
of how they think about this new technology, making the classroom environment ripe for a
productive conversation about expectations surrounding its use. At the very least, the advent of
AI technology should be a call to communicate the learning goals of writing assignments clearly
to students. Given that students generally agree that using AI will compromise their learning,
perhaps the best defense faculty have is to make sure that students see writing assignments as
References
Alby, C. (2023). ChatGPT: A must-see before the semester begins. Faculty Focus.
Anderson, L. W., Krathwohl, D. R., & Bloom, B. S. (2001). A taxonomy for learning, teaching,
Longman.
Aydın, Ö., & Karaarslan, E. (2023). Is Chatgpt leading generative ai? What is beyond
Burton, J. W., Stein, M-K., Jensen, T. B. (2020). A systematic review of algorithm aversion in
https://doi.org/10.1002/bdm.2155.
Castelo N., & Ward, A. F. (2021). Conservatism predicts aversion to consequential Artificial
Chang, T-S., Li, Y., Huang, H-W., & Whitfield, B. (2021). Exploring EFL students' writing
https://doi.org/10.1145/3459043.3459065
Dietvorst, B., Simmons, J. P., & Massey, C. (2015). Algorithm aversion: People erroneously avoid
algorithms after seeing them err. Journal of Experimental Psychology, 144(1), 114-126.
http://dx.doi.org/10.1037/xge0000033.
Fahmi, M. A., & Cahyono, B. Y. (2021). EFL students’ perception on the use of Grammarly and
https://doi.org/10.21070/jees.v6i1.849
36
Figueredo, L. & Varnhagen, C. K. (2005). Didn't you run the spell checker? Effects of type of
spelling error and use of a spell checker on perceptions of the author. Reading
Fish, R., & Hura, G. (2013). Students’ perceptions of plagiarism. Journal of the Scholarship of
Gingerich, K. J., Bugg, J. M., Doe, S. R., Rowland, C. A., Richards, T. L., Tompkins, S. A., &
Graefe, A., Haim, M., Haarmann, B., & Brosius, H. (2018). Readers’ perception of computer-
https://doi.org/10.1177/1464884916641269.
Grobe, C. (2023). Why I’m not scared of ChatGPT. The Chronicle of Higher Education.
Hard, S. F., Conway, J. M., & Moran, A. C. (2006). Faculty and college student beliefs about the
1058–1080. https://doi.org/10.1080/00221546.2006.11778956.
Hatherley, J., Sparrow, R., & Howard, M. (2023). The Virtues of Interpretable Medical AI.
HTTPS://doi.org/10.1017/S0963180122000664.
Huang, K. (2023). Alarmed by AI chatbots, universities start revamping how they teach. New
York Times.
Keles, P. U., & Aydin, S. (2021). University students’ perceptions about artificial intelligence.
https://doi.org/10.34293/education.v9iS1-May.4014.
37
Kellogg, R. T., & Raulerson, B. A. (2007). Improving the writing skills of college
https://doi.org/10.3758/BF03194058.
Kim, N. J., & Kim, M. K. (2022). Teacher’s perceptions of using an artificial intelligence-based
Education, 7. https://doi.org/10.3389/feduc.2022.755914.
Leong, A. (2023). How to detect ChatGPT plagiarism, and why it’s so difficult. Digitaltrends.
https://www.digitaltrends.com/computing/how-to-detect-chatgpt-plagiarism/.
Levine, J., & Pazdernik, V. (2018). Evaluation of a four-prong anti-plagiarism program and the
Liddell, J., & Fong, B. (2005). Faculty perceptions of plagiarism. Journal of College and Character,
6(2). https://doi.org/10.2202/1940-1639.1416.
https://doi.org/10.1177/0098628314549701.
McGuire, L., Lay, K., & Peters, J. (2009). Pedagogy of reflective writing in professional education.
McMurtrie, B. (2023). “Teaching: Will ChatGPT change the way you teach?” The Chronicle of
Higher Education.
National Association of Colleges and Employers. (2022). The Attributes Employers Want to See
attributes-employers-want-to-see-on-college-students-resumes/.
Schumacher, G. M., & Nash, J. G. (1991). Conceptualizing and measuring knowledge change due
www.jstor.org/stable/40171182.
Spirgel, A.S., & Delaney, P.F. (2016). Does writing summaries improve memory for
014-9290-2.
Sumakul, D. T. Y. G, Hamied, F. A., & Sukyadi, D. (2022). Students’ perceptions of the use of AI in
a writing class. Advances in Social Science, Education and Humanities Research, 624, 52-
57.
Wenzelburger G., & Hartmann K. (2022). Policy formation, termination and the multiple
https://doi.org/10.1080/01442872.2021.1922661.
Zimmerman, J. (2006). Why some teachers resist change and what principals can do about
Appendix A
Think about how your personality affects your study habits. Specifically, does being high or low
on a particular personality dimension affect how likely you are to engage in active recall when
you are studying? Be sure to explain these concepts and provide examples from your life.
As a college student, I believe my personality does affect my study habits, specifically in how
likely I am to engage in active recall. I would say that I am relatively high in the trait of
conscientiousness. This means that I am organized, reliable, and hardworking, which translates
to my study habits. I find that I am more likely to engage in active recall when studying because
I am naturally inclined to be organized and plan out my study sessions.
Active recall is a study technique where the learner actively retrieves information from their
memory, as opposed to passively re-reading material. An example of active recall for me would
be creating flashcards and quizzing myself on the information, or teaching the material to a
peer. This helps me to retain the information better and I have found that it leads to better
performance on exams.
Overall, I believe that my high level of conscientiousness plays a big role in my tendency to
engage in active recall when studying. It allows me to be more organized and intentional with
my study habits, which leads to better retention and performance.
ability to recall information while studying, in times when I am highly stressed, it has seemed
extremely difficult to remember details of a lesson. In times when I am level-headed, I feel like I
can recall information much better. Whether or not these personality traits directly affect the
ability to recall information is hard to tell, but it can give insight to my study habits.