Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Open Information Science 2023; 7: 20220158

Research Article

William H. Walters*

The Effectiveness of Software Designed to


Detect AI-Generated Writing: A Comparison
of 16 AI Text Detectors
https://doi.org/10.1515/opis-2022-0158
received August 01, 2023; accepted September 15, 2023

Abstract: This study evaluates the accuracy of 16 publicly available AI text detectors in discriminating between
AI-generated and human-generated writing. The evaluated documents include 42 undergraduate essays gen-
erated by ChatGPT-3.5, 42 generated by ChatGPT-4, and 42 written by students in a first-year composition
course without the use of AI. Each detector’s performance was assessed with regard to its overall accuracy, its
accuracy with each type of document, its decisiveness (the relative number of uncertain responses), the
number of false positives (human-generated papers designated as AI by the detector), and the number of
false negatives (AI-generated papers designated as human). Three detectors – Copyleaks, TurnItIn, and
Originality.ai – have high accuracy with all three sets of documents. Although most of the other 13 detectors
can distinguish between GPT-3.5 papers and human-generated papers with reasonably high accuracy, they are
generally ineffective at distinguishing between GPT-4 papers and those written by undergraduate students.
Overall, the detectors that require registration and payment are only slightly more accurate than the others.

Keywords: AI content detector, AI writing detector, artificial intelligence, chatbot, generative AI

1 Introduction

1.1 Generative AI and AI Text Detectors

Despite the great potential of generative artificial intelligence, the use of AI raises problems in situations
where performance goals are meant to signal progress toward learning goals – where the completion of a
written paper, for instance, is valuable not as an end in itself but as a mechanism for helping students learn
how to plan, complete, and edit their written work (Dweck, 1986). Many authors have expressed concern that
students are submitting papers generated by ChatGPT and other AI tools as their own original work, thereby
attaining the performance goal but bypassing the learning goal. This has implications for teaching, learning,
and academic integrity (e.g., Lund et al., 2023; Marche, 2022). Moreover, students’ use of AI is widespread and
likely to increase. In a recent survey of 1,000 US university students, 43% reported that they had used ChatGPT
or a similar AI tool. Twenty-two percent of all respondents had used AI “to help complete [their] assignments
or exams,” and 32% planned to use or continue using AI in their academic work (Welding, 2023). The problem
may have become more serious since the release of ChatGPT-4 in March 2023 (OpenAI, 2023a,c).
AI text detectors provide qualitative or quantitative assessments of the likelihood that a particular docu-
ment was AI generated. They can therefore help instructors determine whether students have used AI to


* Corresponding author: William H. Walters, Mary Alice & Tom O’Malley Library, Manhattan College, 4513 Manhattan College
Parkway, Riverdale, NY 10471, USA, e-mail: william.walters@manhattan.edu

Open Access. © 2023 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0
International License.
2  William H. Walters

complete their academic work. They can also help students determine whether a particular paper is likely to
trigger allegations of academic misconduct. Many AI detectors work by breaking the text down into tokens
(words or other common sequences of characters) and predicting the probability that a particular token will
be followed by the next in the sequence. The texts most likely to be identified as AI generated are those with
high predictability and low perplexity – those with relatively few of the random elements and idiosyncrasies
that people tend to use in their writing and speech. Some AI text detectors employ other methods (Crothers,
Japkowicz, & Viktor, 2023), but methods based on perplexity and related concepts are most often used by the
detectors available to the general public.

1.2 Previous Evaluations of AI Text Detectors

Quite a few websites and blogs claim to evaluate the accuracy of various AI text detectors (e.g., Abdullahi, 2023;
Andrews, 2023; Aw, 2023; Caulfield, 2023; Cemper, 2023; Compilatio.net, 2023; Demers, 2023; Deziel, 2023;
Gewirtz, 2023; Ivanov, 2023; Singh, 2023; van Oijen, 2023; Wiggers, 2023; Winston.ai, 2023). Unfortunately,
each has significant limitations or biases. Fourteen problems can be readily identified:
1. The authors or their sponsors have a clear conflict of interest; they provide AI detection software or accept
extensive advertising from providers.
2. The assessment has a strong subjective component, often conflating accuracy with other factors such as
convenience or ease of use.
3. The assessment uses a number of different procedures that are not applied systematically to every detector.
4. The tests are performed on just a small number of documents.
5. The report does not specify how the documents were generated or acquired.
6. AI-generated text is evaluated but human-generated text is not. The assessment can therefore detect false
negatives but not false positives.
7. The documents evaluated are not typical of those submitted by students undertaking academic work.
8. The human-generated documents are taken from sources (such as websites) that are potentially available
to AIs as sources of training documents.
9. The human-generated documents are written by the investigators themselves. This introduces the poten-
tial for conscious or unconscious bias.
10. The assessment does not consider a representative set of detectors. It may exclude the newest or most
widely used detectors, or it may compare one effective detector with several ineffective ones.
11. The assessment includes only those detectors that do not require registration or payment.
12. The report does not specify which versions of the detectors were used, or how they were used (e.g., which
test options were chosen, and whether the software evaluated entire documents or just portions of them).
13. The report does not mention the specific responses provided by the software or how those responses were
coded as AI generated, human generated, or uncertain.
14. The results are presented inconsistently, with detailed results for some detectors or documents but not for
others.

At least one website presents a more careful assessment (Gillham, 2023). Moreover, recent scholarly
investigations have avoided most of the problems mentioned here. Since the release of GPT-3.5, more than
a dozen studies have included evaluations of the English-language AI text detectors currently in general use
(Aremu, 2023; Cingillioglu, 2023; Desaire, Chua, Isom, Jarosova, & Hua, 2023; Gao et al., 2023; Guo et al., 2023;
Khalil & Er, 2023; Krishna, Song, Karpinska, Wieting, & Iyyer, 2023; Liang, Yuksekgonul, Mao, Wu, & Zou, 2023;
Pegoraro, Kumari, Fereidooni, & Sadeghi, 2023; Perkins, Roe, Postma, McGaughran, & Hickerson, 2023; Wang,
Liu, Xie, & Li, 2023; Weber-Wulff et al., 2023; Yan, Fauss, Hao, & Cui, 2023). Tables 1 and 2 summarize the results
of the analyses most similar to this investigation. That is, the tables exclude evaluations of detectors not
currently available to the public (e.g., Desaire et al., 2023; Guo et al., 2023; Yan et al., 2023), studies of texts
created by nonnative writers of English (Liang et al., 2023), evaluations of computer code and related materials
(Wang et al., 2023), analyses in which the AI-generated papers were modified before being submitted to the
Table 1: Percentage of ChatGPT texts correctly identified as AI in previous studiesa

Detector Aremu, Cingillioglu, Desaire Gao Guo Khalil Krishna Krishna Liang Liang Pegoraro Perkins Wang Wang Weber- Weber- Yan
2023 2023 et al., et al., et al., & et al., et al., et al., et al., et al., et al., et al., et al., Wulff Wulff et al.,
2023 2023 2023 Er, 2023b 2023c 2023d 2023e 2023 2023 2023c 2023b et al., et al., 2023
2023 2023 2023f

No. of documents 4 75 120 50 27k 50 — — 31 145 7k 22 15k 25k 18 18 800


ChatGPT version — — 3.5 3 3.5 — 3.5 3.5 3.5 3.5 — 4 3.5 3.5 3.5 3.5 3
ChatGPT — — — — — 92 — — — — — — — — — — —
Checker AI — — — — — — — — — — 13 — — — — — —
Compilatio — — — — — — — — — — — — — — 89 92 —
Content at Scale Low — — — — — — — — — 38 — — — 0 0 —
Copyleaks — 97 — — — — — — — — 23 — — — — — —
Crossplag Low — — — — — — — 58 37 — — — — 89 89 —
DetectGPT — — — — — — 27 67 — — 18 — 66 63 56 75 —
Draft and Goal — — — — — — — — — — 24 — — — — — —
GLTR — — — — High — — — — — 32 — — — — — —
GPT-2/RoBERTa — — 92 High High — — — — — 7 — 60 79 94 94 100
GPTZero High 96 — — — — 7 — 100 14 27 — 44 17 78 86 —
Grover — — — — — — — — — — 43 — — — — — —
Hello-SimpleAI — — — — — — — — — — 47 — — — — — —
Hugging Face — — — — — — — — — — 11 — — — — — —
OpenAI Low 96 — — — — 30 41 58 41 32 — 99 74 50 61 —
Originality.ai — — — — — — — — 42 59 8 — — — — — —
Perplexity — — — — — — — — — — 44 — — — — — —
PlagiarismCheck — — — — — — — — — — — — — — 33 47 —
Quill.org — — — — — — — — 58 57 — — — — — — —
RankGen — — — — — — 1 — — — — — — — — — —
RoBERTa-QA — — — — High — — — — — — — 68 67 — —
Sapling Low — — — — — — — 74 68 — — — — — — —
TurnItIn — — — — — — — — — — — 91 — — 94 97 —
Winston AI — — — — — — — — — — — — — — 94 94 —
Writefull — — — — — — — — — — 22 — — — 28 53 —
Writer — — — — — — — — — — 7 — 23 17 44 53 —
ZeroGPT High — — — — — — — 100 31 46 — — — 83 83 —
Effectiveness of Software Designed to Detect AI-Generated Writing

a
Includes only those analyses that evaluated unmodified ChatGPT output. bWikipedia-type articles. cResponses to short questions. dCollege admissions essays. eAbstracts of scientific papers. fHalf credit was
assigned for responses that were neither clearly correct nor clearly incorrect.

3
Table 2: Percentage of human-generated texts correctly identified as human in previous studies

Detector Aremu, Cingillioglu, Desaire Gao Guo Liang Pegoraro Wang Wang Weber-Wulff Weber-Wulff Yan
2023 2023 et al., 2023 et al., et al., et al., et al., 2023 et al., et al., et al., 2023 et al., 2023d et al., 2023
2023 2023 2023a 2023b 2023c

No. of documents 24 75 60 50 59k 88 6k 15k 25k 9 9 800


4  William H. Walters

Checker AI — — — — — — 95 — — — — —
Compilation — — — — — — — — — 89 94 —
Content at Scale 100 — — — — — 80 — — 100 100 —
Copyleaks — 93 — — — — 92 — — — — —
Crossplag 100 — — — — 88 — — — 100 100 —
DetectGPT — — — — — — 80 94 65 100 100 —
Draft and Goal — — — — — — 91 — — — — —
GLTR — — — — High — 98 — — — — —
GPT-2/RoBERTa — — 97 High High — 96 6 11 100 100 100
GPTZero High 96 — — — 100 94 98 97 67 67 —
Grover — — — — — — 91 — — — — —
Hello-SimpleAI — — — — — — 98 — — — — —
Hugging Face — — — — — — 63 — — — — —
OpenAI High 97 — — — 91 92 37 39 100 100 —
Originality.ai — — — — — 99 95 — — — — —
Perplexity — — — — — — 98 — — — — —
PlagiarismCheck — — — — — — — — — 78 89 —
Quill.org — — — — — 91 — — — — — —
RoBERTa-QA — — — — High — — 95 65 — — —
Sapling High — — — — 95 — — — — — —
TurnItIn — — — — — — — — — 100 100 —
Winston AI — — — — — — — — — 78 83 —
Writefull — — — — — — 99 — — 100 100 —
Writer — — — — — — 95 96 93 100 100 —
ZeroGPT High — — — — 100 92 — — 100 100 —
a
Essays by middle school students. bResponses to short questions. cWikipedia-type articles. dHalf credit was assigned for responses that were neither clearly correct nor clearly incorrect.
Effectiveness of Software Designed to Detect AI-Generated Writing  5

detectors (Anderson et al., 2023; Krishna et al., 2023; Sadasivan, Kumar, Balasubramanian, Wang, & Feizi, 2023;
Weber-Wulff et al., 2023), and reports in which the detectors were not identified by name (Dalalah & Dalalah, 2023).
Together, Tables 1 and 2 suggest that GPT-2/RoBERTa, TurnItIn, and ZeroGPT are the most consistently
accurate detectors. Overall, however, the results for the 27 detectors are not consistent across the 29 analyses.
There are at least three reasons for this. First, three different versions of ChatGPT were used to generate the AI
documents. Most of the investigations used GPT-3.5, but at least two used GPT-3 and at least one used GPT-4.
Second, the documents themselves are of various types. Seventeen analyses evaluated undergraduate essays
or responses to short, straightforward questions, but the others used a variety of texts including abstracts of
scientific papers (Gao et al., 2023; Liang et al., 2023), college admissions essays (Liang et al., 2023), essays by
middle school students (Liang et al., 2023), examination papers (Yan et al., 2023), overview articles in scientific
journals (Desaire et al., 2023), and Wikipedia-type articles (Krishna et al., 2023; Wang et al., 2023). Finally, each
research team interpreted the detector output differently, adopting either rigorous or lenient standards for the
identification of AI- and human-generated text. This at least partly explains why some detectors performed
well in certain studies but not nearly as well in others.

2 Methods
This study evaluates the accuracy of 16 publicly available AI text detectors using three sets of documents: 42
undergraduate essays generated by ChatGPT-3.5, 42 generated by ChatGPT-4, and 42 written without the use of
AI by students in a first-year composition course. Each detector’s performance was assessed with regard to its
overall accuracy across all 126 documents, its accuracy when tested against each of the three sets of docu-
ments, its decisiveness (the relative number of uncertain responses), the number of false positives (human-
generated papers designated as AI by the detector), and the number of false negatives (AI-generated papers
designated as human by the detector). The analysis involved four steps:
1. Prepare the three sets of documents.
2. Select the 16 AI text detectors to include in the study.
3. Use each detector to evaluate each of the 126 documents, coding the responses as AI, human, or uncertain.
4. Evaluate the accuracy of each detector – its effectiveness in identifying AI-generated and human-gener-
ated text.

2.1 Preparing the 126 Documents

GPT 3.5 and GPT 4 were each used to generate 42 short papers (literature reviews) of the kind typically
expected of students in first-year composition courses at US universities. The 42 paper topics cover the social
sciences, the natural sciences, and the humanities (Appendix 1). A new chat/conversation was initiated for each
paper topic, and each topic was embedded within a ChatGPT prompt of the type recommended by Atlas (2023).
The same introductory text was used in each case: “I want you to act as an academic researcher. Your task is to
write a paper of approximately 2000 words with parenthetical citations and a bibliography that includes at
least 5 scholarly resources such as journal articles and scholarly books. The paper should respond to this
question: ‘[paper topic].’” Because the ChatGPT response field is limited in length, the system’s initial response
to each prompt was never a complete paper. An additional prompt of “Please continue” was used, sometimes
more than once, to get ChatGPT to continue the text exactly where it had left off.1 All the AI texts were
generated in the first week of April 2023.


1 If “Please continue” was entered near the end of the paper, ChatGPT sometimes provided supplementary text that was not fully
integrated into the main body of the paper, presumably on the assumption that the original response was unsatisfactory or
inadequate. For this study, any text that followed the bibliography was not regarded as part of the paper generated by ChatGPT
and was therefore excluded from the analysis.
6  William H. Walters

The 42 human-generated documents were taken from a set of 178 papers submitted by Manhattan College
English 110 (First Year Composition) students during the 2014–2015 academic year. The use of papers from 2014
to 2015, before the widespread availability of AI tools such as ChatGPT, ensures that these papers were created
without the use of AI. Although the English 110 papers do not cover the exact same topics as the AI-generated
papers, they are quite similar; they cover topics such as gun control, racism in the US education system, policy
responses to climate change, robotic warfare, family structure in traditional folk tales, e-cigarettes and public
health, the ethical implications of the death penalty, concussion in the National Hockey League, and 3D
printing technology. Stratified random sampling was used to select a set of papers with the same broad subject
representation as the ChatGPT documents: 25 papers in the social sciences, 9 in the natural sciences, and 8 in
the humanities.

2.2 Selecting the 16 AI Text Detectors

Although dozens of AI text detectors are available online, just 10 appear on two or more of five recent “best AI
text detector” lists (Abdullahi, 2023; Caulfield, 2023; Ganesh, 2023; Somoye, 2023; Wiggers, 2023): Content at
Scale (2023), Copyleaks (2023), Crossplag (2023), GPT Radar (2023), GPTZero (2023), OpenAI (2023b),2 Originali-
ty.ai (2023), Sapling (2023), Writer (2023), and ZeroGPT (2023). This study evaluates those 10 AI text detectors,
along with TurnItIn and 5 others (Table 3).
TurnItIn (2023) was added to the study due to its widespread availability at colleges and universities in the
United States and elsewhere. Instructors at institutions with subscriptions to the TurnItIn plagiarism detector
also have access to the AI text detector, unless their universities have chosen not to make it available.3
The five other AI text detectors included in the study – ContentDetector.ai (2023), Grammica (2023),
IvyPanda (2023), Scribbr (2023), and SEO.ai (2023) – are promoted widely online, do not require registration
or payment, and do not appear on any of the five “best detector” lists. Arguably, these detectors are typical of
the tools students might use to conduct a quick check of their papers for evidence of AI involvement. A Google
search for free AI text detector was conducted, and the first five detectors that met the criteria (and that
worked reliably for the set of 126 documents) were included in the study. Some of them are clearly intended for
students who want to use AI without getting caught, and the IvyPanda site includes advertisements for a
paper-writing service (“Our experts can complete a task on any subject based on your instructions – without
any AI! To ensure that your paper is 100% human-written and plagiarism-free, place an order here.”)

2.3 Evaluating the Documents and Coding the Responses

Each of the 126 documents was stripped of any introductory material (e.g., course and author information),
tables, figures, and lists of works cited, then entered into each of the 16 AI text detectors in plain-text format.
Documents longer than the maximum allowable length (Table 3) were truncated. The detector tests were
conducted from June 25 through July 12, 2023.


2 The OpenAI text detector was discontinued on July 20, 2023, due to low accuracy. As of September 2023, it is no longer available
online.
3 As one reviewer pointed out, the results for TurnItIn may be biased if (a) some students’ papers were submitted to TurnItIn for
plagiarism checking and (b) those papers were subsequently used to train the TurnItIn AI text detector. In that case, the TurnItIn
detector might be expected to perform especially well with this particular set of human-generated papers. This is not likely to be a
major problem, however, since very few Manhattan College instructors used TurnItIn in 2014–2015. A related issue is that students
may have submitted their papers to the iThenticate plagiarism checker, which is associated with TurnItIn and uses the same set of
texts. Unfortunately, we do not know the extent to which this might have occurred.
Table 3: Characteristics of the 16 AI text detectors

Detector Payment Limits on use Input Min. Max. length Longer docs.
length

Content at scale Not required None Text box 4 wds. 25,000 chars. Truncates
ContentDetector.ai Not required None Text box 2 wds. ∼15,000 wds. Will not process
Copyleaksa Free: up to 45,000 wds. per day; Without registration: 6,250 wds. per day; with Free: text box; 150 chars. Free: 25,000 chars.; Will not process
thereafter: when billed monthly, $0.28 to registration but without payment: 45,000 subscribers: text box or subscribers: 500,000 wds.
$0.44 per thousand wds. wds. per day; with registration and payment: upload
depends on amount paid
Crossplag Free, but registration is required for full None Text box 2 wds. 3,000 wds. Truncates
functionality
Grammica Not required None Text box 2 wds. ∼380 wds. Truncates
GPT Radar Free: up to ∼2,500 wds. per day; Depends on amount paid Text box ∼75 wds. ∼1,400 wds. – lower than Will not process
thereafter: ∼$0.02 per 125 wds. the stated limit
GPTZerob Classic: not required; Educator (more Classic: Limits not stated; Educator: 1 million Text box or upload 250 Classic: 5,000 chars.; Text box: will not
effective): $9.99 per month; Pro (most wds. per month; Pro: 2 million wds. per chars. Educator: 50,000 chars.; process; upload:
effective): $19.99 per month month Pro: 50,000 chars. truncates
IvyPanda Free, but registration is required None Text box 2 wds. 4,500 chars. Truncates
OpenAI Free, but registration is required None Text box 1,000 ∼3,000 wds. Will not process
chars.
Originality.aic $0.01 per 100 wds. Depends on amount paid Text box 50 wds. 10,000 wds. Will not process
Sapling Free version has limited functionality; None Text box ∼150 Free: ∼2,000 chars.; paid: Truncates
subscription: $25 per month, but the chars. ∼8,000 chars.
system may offer a free 1-month trial
Scribbr Not required None Text box 25 wds. 500 wds. Will not process
SEO.ai Not required None Text box 2 wds. 5,000 chars. Truncates
TurnItIn Institutional subscription required None Upload 20 wds. 800 pages Will not process
Writer Not required None Text box 2 wds. 1,500 chars. Will not process
ZeroGPT Not required None Text box or upload 2 wds. 50,000 chars. Will not process
a
Free interface: https://copyleaks.com/ai-content-detector; subscriber interface: https://app.copyleaks.com/dashboard/v1/account/new-scan. bThis study presents the Pro results; the Educator results are
identical except that one GPT-3.5 paper classified as uncertain by Educator is classified as AI by Pro. cThis study uses detection model 1.4 rather than 1.1.
Effectiveness of Software Designed to Detect AI-Generated Writing

7
8  William H. Walters

As Appendix 2 reveals, each detector’s output is unique. The responses used by the detectors to characterize
the documents vary in five important respects:
1. whether they include descriptive text, numeric values, or both
2. whether the wording of the text is formal or casual
3. whether the assessments suggest a high degree of confidence (“this text is AI generated”) or greater
ambiguity (“parts of the text may show evidence of AI involvement”)
4. whether the numeric scores represent the proportion of the text that is AI generated, the detector’s level of
confidence in the result, or something else
5. whether there are just a few possible responses or many.

Each of the 2,016 responses was coded as AI generated, human generated, or uncertain. (AI generated
indicates that a significant portion of the text – not necessarily all of it – is likely to be AI generated.) For
responses that included both descriptive text and a numeric component, the descriptive text (e.g., “likely AI
generated”) was regarded as definitive. For the strictly numeric results provided by Grammica, Originality.ai,
Sapling, Scribbr, and TurnItIn, each response was categorized as AI, human, or uncertain based on three
factors: the meaning of the numeric value, the natural breaks in the frequency distribution, and the general
principle that roughly twice as many responses should be included in the AI category as in the human category.
Although just one individual coded the responses, the distinctions among the AI, uncertain, and human
categories were generally quite clear. (Appendix 2 shows the responses generated by the AI text detectors and
the number of times each response was given.) The only difficulty occurred with Sapling, for which the breaks
in the frequency distribution were not always pronounced. Overall, the classifications used here are very
similar to those adopted by Weber-Wulff et al. (2023).

3 Results and Discussion

3.1 Accuracy of the 16 AI Text Detectors

Two of the 16 detectors, Copyleaks and TurnItIn, correctly identified the AI- or human-generated status of all
126 documents, with no incorrect or uncertain responses. As noted in Section 2.2, however, it is possible that
TurnItIn performs especially well with the human-generated papers used in this particular analysis. A third
detector, Originality.ai, performed nearly as well, correctly assessing the status of all but two documents –
human-generated papers that it could not classify with certainty (Table 4 and Figure 1).
Among the other 13 detectors, overall accuracy ranges from 63 to 88%. The distribution of percentage
correct follows a smooth progression, with just three distinct groups: the top 3 detectors, the next 11, and the
bottom 2 – Sapling and ContentDetector.ai.
All the detectors except Content at Scale and ContentDetector.ai are able to identify the GPT-3.5 documents
as AI generated at least 86% of the time, and seven perform flawlessly with this particular set of documents
(Figure 2). Likewise, all but three – ZeroGPT, SEO.ai, and Sapling – are effective at identifying human-gener-
ated text (Figure 3). However, only the top three detectors can correctly classify GPT-4 documents with greater
than 83% accuracy; the rest tend to classify those documents as human or uncertain (Figure 4). Arguably, this is
the most important distinction between the top 3 detectors and the other 13.

3.2 Correlates of Accuracy

As noted in Section 2.2, 10 of the 16 detectors were initially identified through online “best detector” lists.
Overall, the detectors that appear on these lists are only marginally more accurate than the others – 81%
Table 4: Percentage of documents for which each detector gave correct or incorrect responsesa

All papers AI papers GPT-3.5 papers GPT-4 papers Human papers

Detector Percentage Percentage Percentage Percentage Percentage Percentage Percentage Percentage Percentage Percentage Percentage
correct incorrect uncertain correct incorrect correct incorrect correct incorrect correct incorrect

Copyleaksb 100 0 0 100 0 100 0 100 0 100 0


TurnItIn 100 0 0 100 0 100 0 100 0 100 0
Originality.aib 98 0 2 100 0 100 0 100 0 95 0
Scribbr 88 11 1 85 15 100 0 69 31 95 2
ZeroGPTb 87 1 12 92 0 100 0 83 0 79 2
Grammica 86 11 3 81 17 100 0 62 33 95 0
GPTZerob 81 4 15 77 5 98 0 57 10 88 2
Crossplagb 80 20 0 77 23 86 14 69 31 86 14
OpenAIb 78 6 17 69 8 98 2 40 14 95 0
IvyPanda 77 0 23 71 0 100 0 43 0 88 0
GPT Radarb 76 24 0 64 36 98 2 31 69 100 0
SEO.ai 72 4 24 92 0 100 0 83 0 33 12
Content at Scaleb 71 13 15 63 15 74 2 52 29 88 10
Writerb 71 29 0 64 36 88 12 40 60 86 14
Saplingb 65 7 28 63 11 93 0 33 21 69 0
ContentDetector.ai 63 10 27 45 14 83 0 7 29 100 0
Avg. percentage 81 9 10 78 11 95 2 61 20 87 4
Standard deviation 12 9 11 16 12 8 4 28 22 17 5
Median percentage 79 7 8 77 10 99 0 60 18 92 0
a
In each case, the percentage uncertain is the percentage neither correct nor incorrect. bAppears on at least two of the “best AI text detector” websites.
Effectiveness of Software Designed to Detect AI-Generated Writing

9
10  William H. Walters

Figure 1: Percentage of all 126 documents for which each detector gave correct, uncertain, or incorrect responses.

correct versus 77%. For the set of all detectors other than TurnItIn, there is no meaningful correlation between
the accuracy of a detector and its appearance on the “best detector” lists; Kendall’s tau-b = 0.08.
In general, the accuracy of a detector is only modestly associated with its paid or free status. While all
three of the most accurate detectors require registration and payment for full functionality, the three others
that require payment – GPTZero, GPT Radar, and Sapling – have just average or below-average accuracy.
Among the six detectors that require a subscription, the average accuracy is 87%; among the others, it is 77%.
Overall, the correlation between the accuracy of a detector and its paid or free status is weak; Kendall’s tau-b
= 0.29.

3.3 Key Similarities and Differences Among the 16 AI Text Detectors

Table 5 highlights the characteristics that set each detector apart from the others. Copyleaks, TurnItIn, and
Originality.ai are similar in many respects. Likewise, ZeroGPT and GPTZero are much the same, as are Sapling
and ContentDetector.ai.
The three accuracy columns in Table 5 are based not just on percentage correct, but on percentage
incorrect and the ratio of correct to incorrect responses. For example, GPTZero has a high accuracy designation
while Crossplag does not – but this cannot be attributed to the one-point difference in their accuracy rates.
Instead, it reflects the fact that GPTZero has a lower rate of incorrect responses. When the type of document is
unclear, GPTZero generally gives a response of uncertain. In contrast, Crossplag is more likely to label AI text
as human and vice versa.
Effectiveness of Software Designed to Detect AI-Generated Writing  11

Figure 2: Percentage of the 42 GPT-3.5 documents for which each detector gave correct, uncertain, or incorrect responses.

As described in Section 3.1, many detectors are effective at identifying GPT-3.5 text but ineffective at
identifying GPT-4 text. This same result can be seen when percentage incorrect is taken into account. In
particular, four detectors have excellent performance with regard to GPT-3.5 but very poor performance
with regard to GPT-4. GPT Radar is perhaps the best example of this, with correct responses for 98% of the
GPT-3.5 documents but for just 31% of the GPT-4 documents – worse than might be expected due to chance
alone.
The decisiveness column represents the percentage of documents for which each detector gave responses
of AI or human rather than uncertain. The high decisiveness label was assigned to detectors with uncertainty
rates lower than 4% and the low label to those with uncertainty rates higher than 22%.
The false positives column identifies the detectors that are especially likely to respond AI when evaluating
papers written by humans. The four detectors labeled many each have false positive rates of 10–14%. In
contrast, the other detectors each have no more than a single false positive within the set of 42 human-
generated documents.
Likewise, the false negatives column identifies the detectors that are especially likely to respond human for
papers that were actually produced by an AI. Crossplag, GPT Radar, and Writer each have false negative rates
of 23–36%, while the other detectors have a maximum rate of 17% and a mean of 6.5%.
The many false positives for SEO.ai and Content at Scale reflect their general tendency to declare that text
is AI rather than human. Likewise, the many false negatives for GPT Radar reflect its tendency to label text as
human rather than AI. The situation is different for Crossplag and Writer, however. Those two detectors have
many false positives and many false negatives due to a combination of relative inaccuracy and high decisive-
ness. Overall, the more accurate detectors tend to be more decisive – the correlation between percentage
correct and percentage uncertain is −0.68 – but Crossplag and Writer are exceptions to that general
relationship.
12  William H. Walters

Figure 3: Percentage of the 42 human-generated documents for which each detector gave correct, uncertain, or incorrect responses.

4 Conclusion

4.1 Main Findings

The results of this study support three main conclusions:


1. Three AI text detectors – Copyleaks, TurnItIn, and Originality – have very high accuracy with all three sets
of documents examined for this study: GPT-3.5 papers, GPT-4 papers, and human-generated papers.
2. Most of the other detectors can distinguish between GPT-3.5 papers and human-generated papers with
reasonably high accuracy. However, most are ineffective at distinguishing between GPT-4 papers and
papers written by students.
3. In general, a detector’s free or paid status is not a good indicator of its accuracy, nor is its appearance on the
“best AI text detector” lists considered here.

Several recent articles in the popular press have asserted that AI-generated text is almost impossible to
identify (Heikkilä, 2023; Maruccia, 2023; Mujezinovic, 2023; Wiggers, 2023; Williams, 2023), and it is true that
most detectors perform poorly with GPT-4 documents. However, these results also suggest that technological
improvements in publicly available AI text generators are matched very quickly by improvements in the
capabilities of the best AI text detectors. The release of GPT-4 in March 2023 may have given AI users a
temporary ability to pass off AI text as human-authored – but less than 4 months later, the three most effective
AI text detectors perform just as well with GPT-4 documents as with GPT-3.5 documents.
Effectiveness of Software Designed to Detect AI-Generated Writing  13

Figure 4: Percentage of the 42 GPT-4 documents for which each detector gave correct, uncertain, or incorrect responses.

Table 5: Effectiveness of the 16 AI text detectors

Detector Overall Accuracy, Accuracy, GPT-4 Decisive- False positives False negatives
accuracy GPT-3.5 ness

Copyleaksa V. high V. high V. high High — —


TurnItIn V. high V. high V. high High — —
Originality.aia V. high V. high V. high High — —
Scribbr High V. high — High — —
ZeroGPTa High V. high — — — —
Grammica High V. high Low High — —
GPTZeroa High V. high — — — —
Crossplaga — — — High Many Many
OpenAIa — V. high Low — — —
IvyPanda — V. high Low Low — —
GPT Radara — V. high Low High — Many
SEO.ai — V. high — Low Many —
Content at Scalea — — Low — Many —
Writera — — Low High Many Many
Saplinga Low — Low Low — —
ContentDetector.ai Low — Low Low — —

Appears on at least two of the “best AI text detector” websites.


a
14  William H. Walters

4.2 Previous Research and New Results

Previous research suggests that TurnItIn, ZeroGPT, and GPT-2/RoBERTa are among the more accurate AI text
detectors (Tables 1 and 2). These results support those earlier findings with regard to TurnItIn and ZeroGPT.
Of the top three detectors identified in this investigation, TurnItIn achieved very high accuracy in all five
previous evaluations. Copyleaks, included in four earlier analyses, performed very well in three of them. The
prior results for Originality.ai are mixed, suggesting that it classifies human-generated documents accurately
but has difficulty with AI-generated text. In this analysis, no such difficulty can be seen (Tables 4 and 5). As
noted in Section 1.2, previous studies have used a wide range of methods that do not always generate
comparable results. Consequently, comparative analyses such as this are especially important.

4.3 Implications

Many authors have called for the modification of traditional undergraduate essays and written assignments in
ways that circumvent the capabilities of generative AI (e.g., Baidoo-Anu & Owusu Ansah, 2023; Golinkoff &
Wilson, 2023; Marche, 2022; Rigolino, 2023; Tate, 2023). At the most superficial level, this involves changes in
assessment methods – a greater reliance on in-class exams and interactive presentations, for instance. At a
deeper level, it involves a greater emphasis on the kinds of capabilities that are unique to humans, such as the
generation and refinement of ideas rather than texts. AI tools can also be incorporated into teaching, helping
students learn how to edit, how to evaluate subtle differences in style and content, how to determine whether
an assertion is supported by evidence, and how to use AI effectively. Even in circumstances where the use of AI
is accepted or required, however, there is still a need to determine the extent of AI involvement.
When students are not expected to use AI, false positives can lead to unwarranted accusations of mis-
conduct while false negatives may allow violations of academic integrity to go undetected. For this reason, the
detectors with high false positive or false negative rates (Table 5) should be avoided. If we also exclude the
detectors that are generally ineffective in detecting GPT-4 text, just a few detectors – essentially, the top three –
remain as viable candidates for use in the academic environment.
Local and individual factors are likely to influence the ways in which AI text detectors are used and
perceived. Some faculty may be inclined to accept their results uncritically, without further investigation or
consideration of the context. At the same time, other faculty may reject the use of detectors in favor of less
systematic, intuitive judgments. It is probably best to adopt a moderate approach – to consider the results
provided by AI text detectors, to account for other evidence as well, and to acknowledges that some detectors
are far more effective (or ineffective) than others. Assessments of students’ work should also consider the
specific parts of the text for which AI involvement was detected. Fortunately, 10 of the detectors evaluated here
– all but Crossplag, Grammica, OpenAI, Scribbr, SEO.ai, and Writer – provide separate assessments or scores
for particular phrases, sentences, or paragraphs within each document.

4.4 Limitations and Further Research

Because this investigation used student papers that could potentially have been used to train the TurnItIn
detector, TurnItIn may be especially accurate for the particular human-generated texts evaluated here. As
noted in Section 2.2, however, this is unlikely to have had a major impact on the results. More generally, this
analysis is based on a set of 126 undergraduate composition papers (literature reviews), so the results may not
be generalizable to other kinds of documents. The most significant limitation of the study, however, is that it
does not account for the fact that users of ChatGPT are likely to paraphrase or otherwise modify AI-generated
texts rather than simply submitting them, unaltered, as their own academic work (Welding, 2023). It is
important to know how well these detectors perform with unaltered ChatGPT text, but a more realistic
Effectiveness of Software Designed to Detect AI-Generated Writing  15

assessment would also evaluate their effectiveness in identifying documents that have been generated by AI,
then modified by users.
This is just the second study to evaluate the effectiveness of publicly available AI text detectors in
identifying documents generated by ChatGPT-4. (Perkins et al., 2023, was the first.) Additional analyses of
GPT-4 documents are needed. Moreover, this investigation and other recent studies suggest several questions
for further research:
1. How well do AI text detectors evaluate documents that are partly AI generated and partly human gener-
ated? Are the assessments provided by the detectors (e.g., “30% AI”) accurate, and does their accuracy vary
with the proportion of AI-generated text?
2. What paraphrasing strategies are most effective at thwarting AI text detectors? For instance, is it better to
replace words with less common synonyms, to change the order of clauses, or to introduce idiosyncratic
phrases? Several studies have shown that paraphrasing can alter AI-generated texts to make them less
susceptible to detection (Anderson et al., 2023; Krishna et al., 2023; Sadasivan et al., 2023; Weber-Wulff et al.,
2023), but none have evaluated the effectiveness of the various paraphrasing techniques.
3. How do students actually modify AI-generated or AI-assisted texts when completing their assignments? Are
those modifications effective at rendering AI involvement undetectable?

Finally, there is a need to investigate potential biases in the performance of AI text detectors. Liang et al.
(2023) have demonstrated that the texts written by nonnative speakers of English are far more likely than
those of native speakers to generate false positive responses. If would be helpful to know whether this bias is
widespread or whether it is restricted to particular types of authors or documents.

Acknowledgments: I am grateful for the comments of Esther Isabelle Wilder and two anonymous referees.

Funding information: No funding was involved.

Conflict of interest: The author states no conflict of interest.

Data availability statement: The texts generated by GPT-3.5 and GPT-4 in response to the 42 prompts are
available from the author on request, as are the results (responses) generated by the 16 AI text detectors for
each of the 126 documents.

References
Abdullahi, A. (2023, May 5). Top 10 AI detector tools for 2023. eWeek. https://www.eweek.com/artificial-intelligence/ai-detector-software/.
Allison, N. (2023, Mar. 16). 250 + interesting research paper topics for 2022. MyPerfectWords. https://myperfectwords.com/blog/research-
paper-guide/research-paper-topics.
Anderson, N., Belavy, D. L., Perle, S. M., Hendricks, S., Hespanhol, L., Verhagen, E., & Memon, A. R. (2023). AI did not write this manuscript,
or did it? Can we trick the AI text detector into generated texts? The potential future of ChatGPT and AI in sports & exercise medicine
manuscript generation. BMJ Open Sport & Exercise Medicine, 9(1), article e001568. doi: 10.1136/bmjsem-2023-001568.
Andrews, E. (2023). Comparing AI detection tools: One instructor’s experience. Academic Honesty and Integrity. https://tilt.colostate.edu/
comparing-ai-detection-tools-one-instructors-experience/.
Aremu, T. (2023, June 7). Unlocking Pandora’s box: Unveiling the elusive realm of AI text detection. Rochester, NY: SSRN. doi: 10.2139/ssrn.
4470719.
Atlas, S. (2023). Chatbot prompting: A guide for students, educators, and an AI-augmented workforce. https://www.researchgate.net/
publication/367464129_Chatbot_Prompting_A_guide_for_students_educators_and_an_AI-augmented_workforce.
Aw, B. (2023, July 23). 12 best AI detectors in 2023: Results from 180 tests. https://brendanaw.com/best-ai-detector.
Baidoo-Anu, D., & Owusu Ansah, L. (2023, Jan. 25). Education in the era of generative artificial intelligence (AI): Understanding the potential
benefits of ChatGPT in promoting teaching and learning. Rochester, NY: SSRN. doi: 10.2139/ssrn.4337484.
Caulfield, J. (2023, June 2). Best AI detector: Free & premium tools compared. Scribbr. https://www.scribbr.com/ai-tools/best-ai-detector/.
16  William H. Walters

Cemper, C. C. (2023, Jan. 29). 13 AI content detection tools tested and AI watermarks. LinkResearchTools. https://www.linkresearchtools.
com/blog/ai-content-detector-tools/.
Cingillioglu, I. (2023). Detecting AI-generated essays: The ChatGPT challenge. International Journal of Information and Learning Technology,
40(3), 259–268. doi: 10.1108/IJILT-03-2023-0043.
Compilatio.net. (2023, Feb. 16). Comparison of the best AI detectors in 2023. https://www.compilatio.net/en/blog/best-ai-detectors.
Content at Scale. (2023). AI detector for ChatGPT, GPT4, bard & more. https://contentatscale.ai/ai-content-detector/.
ContentDetector.ai. (2023). AI content detector – ChatGPT plagiarism checker. https://contentdetector.ai/.
Copyleaks. (2023). AI content detector. https://copyleaks.com/ai-content-detector.
Crossplag. (2023). AI content detector. https://crossplag.com/ai-content-detector/.
Crothers, E. N., Japkowicz, N., & Viktor, H. L. (2023, July 18). Machine-generated text: A comprehensive survey of threat models and
detection methods. IEEE Access, 11, 70977–71002. doi: 10.1109/ACCESS.2023.3294090.
Dalalah, D., & Dalalah, O. M. A. (2023). The false positives and false negatives of generative AI detection tools in education and academic
research: The case of ChatGPT. International Journal of Management Education, 21(2), article 100822. doi: 10.1016/j.ijme.2023.100822.
Demers, T. (2023, Apr. 25). 16 of the best AI and ChatGPT content detectors compared. Search Engine Land. https://searchengineland.
com/ai-chatgpt-content-detectors-395957.
Desaire, H., Chua, A. E., Isom, M., Jarosova, R., & Hua, D. (2023). Distinguishing academic science writing from humans or ChatGPT with
over 99% accuracy using off-the-shelf machine learning tools. Cell Reports Physical Science, 4(6), article 101426. doi: 10.1016/j.xcrp.
2023.101426.
Deziel, M. (2023, Feb. 19). We pitted ChatGPT against tools for detecting AI-written text, and the results are troubling. The Conversation.
https://theconversation.com/we-pitted-chatgpt-against-tools-for-detecting-ai-written-text-and-the-results-are-troubling-199774.
Dweck, C. S. (1986). Motivational processes affecting learning. American Psychologist, 41(10), 1040–1048. doi: 10.1037/0003-066X.41.
10.1040.
Ganesh, S. (2023, June 12). Explore these top 5 AI detector tools to detect AI-generated content. Analytics Insight. https://www.
analyticsinsight.net/explore-these-top-5-ai-detector-tools-to-detect-ai-generated-content/.
Gao, C. A., Howard, F. M., Markov, N. S., Dyer, E. C., Ramesh, S., Luo, Y., & Pearson, A. T. (2023). Comparing scientific abstracts generated
by ChatGPT to real abstracts with detectors and blinded human reviewers. NPJ Digital Medicine, 6, article 75. doi: 10.1038/s41746-023-
00819-6.
Gewirtz, D. (2023, Jan. 13). Can AI detectors save us from ChatGPT? I tried 3 online tools to find out. ZDNET Tech Today. https://www.zdnet.
com/article/can-ai-detectors-save-us-from-chatgpt-i-tried-3-online-tools-to-find-out/.
Gillham, J. (2023). AI content detector accuracy review + open source dataset and research tool. Originality.ai. https://originality.ai/blog/
ai-content-detection-accuracy.
Golinkoff, R. M., & Wilson, J. (2023, Feb. 2). ChatGPT is a wake-up call to revamp how we teach writing. Philadelphia Inquirer. https://www.
inquirer.com/opinion/commentary/chatgpt-ban-ai-education-writing-critical-thinking-20230202.html.
GPT Radar. (2023). Detect AI generated text in a click. https://gptradar.com/.
GPTZero. (2023). More than an AI detector. Preserve what’s human. https://gptzero.me/.
Grammica. (2023). AI detector. https://grammica.com/ai-detector.
Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., … Wu, Y. (2023, Jan. 18). How close is ChatGPT to human experts? Comparison corpus,
evaluation, and detection. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2301.07597.
Heikkilä, M. (2023, Feb. 7). Why detecting AI-generated text is so difficult (and what to do about it). MIT Technology Review. https://www.
technologyreview.com/2023/02/07/1067928/why-detecting-ai-generated-text-is-so-difficult-and-what-to-do-about-it/.
Ivanov, V. (2023, June 23). Which is the best AI content detector? https://trickmenot.ai/which-is-the-best-ai-content-detector/.
IvyPanda. (2023). GPT essay checker for students. https://ivypanda.com/gpt-essay-checker.
Kearney, V. (2022, Oct. 26). 100 technology topics for research papers. Owlcation. https://owlcation.com/academia/100-Technology-
Topics-for-Research-Paper.
Khalil, M., & Er, E. (2023, Feb. 8). Will ChatGPT get you caught? Rethinking of plagiarism detection. Ithaca, NY: Cornell University. Ithaca, NY:
Cornell University. doi: 10.48550/arXiv.2302.04335.
Krishna, K., Song, Y., Karpinska, M., Wieting, J., & Iyyer, M. (2023, Mar. 23). Paraphrasing evades detectors of AI-generated text, but retrieval is
an effective defense. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2303.13408.
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023, July 10). GPT detectors are biased against non-native English writers. Ithaca, NY:
Cornell University. doi: 10.48550/arXiv.2304.02819.
Lund, B. D., Wang, T., Mannuru, N. R., Nie, B., Shimray, S., & Wang, Z. (2023). ChatGPT and a new academic reality: Artificial Intelligence-
written research papers and the ethics of the large language models in scholarly publishing. Journal of the Association for Information
Science and Technology, 74(5), 570–581. doi: 10.1002/asi.24750.
Marche, S. (2022, Dec. 6). The college essay is dead. Nobody is prepared for how AI will transform academia. The Atlantic. https://www.
theatlantic.com/technology/archive/2022/12/chatgpt-ai-writing-college-student-essays/672371/.
Maruccia, A. (2023, Mar. 22). Reliable detection of AI-generated text is impossible, a new study says. TechSpot. https://www.techspot.
com/news/98031-reliable-detection-ai-generated-text-impossible-new-study.html.
Mujezinovic, D. (2023, May 11). AI content detectors don’t work, and that’s a big problem. MUO: Make Use Of. https://www.makeuseof.
com/ai-content-detectors-dont-work/.
OpenAI. (2023a). GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. https://openai.com/gpt-4.
Effectiveness of Software Designed to Detect AI-Generated Writing  17

OpenAI. (2023b). New AI classifier for indicating AI-written text. https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text.


OpenAI. (2023c, Mar. 27). GPT-4 technical report. https://paperswithcode.com/paper/gpt-4-technical-report-1.
Originality.ai. (2023). Most accurate AI content checker & plagiarism checker for content marketers. https://originality.ai/.
Paperell.net. (2023). 200 best research paper topics for 2023 + examples. https://paperell.net/blog/best-research-paper-topics.
Pegoraro, A., Kumari, K., Fereidooni, H., & Sadeghi, A.-R. (2023, Apr. 5). To ChatGPT, or not to ChatGPT: That is the question! Ithaca, NY:
Cornell University. doi: 10.48550/arXiv.2304.01487.
Perkins, M., Roe, J., Postma, D., McGaughran, J., & Hickerson, D. (2023, May 29). Game of tones: Faculty detection of GPT-4 generated content
in university assessments. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2305.18081.
Rigolino, R. E. (2023, Jan. 31). With ChatGPT, we’re all editors now. Inside Higher Ed. https://www.insidehighered.com/views/2023/01/31/
chatgpt-we-must-teach-students-be-editors-opinion.
Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2023, June 28). Can AI-generated text be reliably detected? Ithaca, NY:
Cornell University. doi: 10.48550/arXiv.2303.11156.
Sapling. (2023). AI detector. https://sapling.ai/ai-content-detector.
Sarikas, C. (2020, Jan. 25). 113 great research paper topics. PrepScholar. https://blog.prepscholar.com/good-research-paper-topics.
Scribbr. (2023). Free AI detector. https://www.scribbr.com/ai-detector/.
SEO.ai. (2023). AI content detector. https://seo.ai/detector.
Singh, A. (2023, July 24). 12 best AI content detectors of 2023 (accurate data). DemandSage. https://www.demandsage.com/ai-content-
detectors/.
Somoye, F. L. (2023, June 12). ChatGPT detectors in 2023. PC Guide. https://www.pcguide.com/apps/chat-gpt-detectors/.
Tate, J. (2023, Feb. 5). Socrates never wrote a term paper. Wall Street Journal. 281, A15. https://www.wsj.com/articles/socrates-never-
wrote-a-term-paper-education-teaching-learning-college-ai-chatgpt-lecturing-students-11675613853.
TurnItIn. (2023). Empower students to do their best, original work. https://www.turnitin.com/.
van Oijen, V. (2023, Mar. 31). AI-generated text detectors: Do they work? SURF Communities: AI in Education. https://communities.surf.nl/
en/ai-in-education/article/ai-generated-text-detectors-do-they-work.
Walters, W. H., Sheehan, S. E., Handfield, A. E., López-Fitzsimmons, B. M., Markgren, S., & Paradise, L. (2020). A multi-method information
literacy assessment program: Foundation and early results. Portal: Libraries and the Academy, 20(1), 101–135. doi: 10.1353/pla.
2020.0006.
Wang, J., Liu, S., Xie, X., & Li, Y. (2023, Apr. 11). Evaluating AIGC detectors on code content. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.
2304.05193.
Weber-Wulff, D., Anohina-Naumeca, A., Bjelobaba, S., Foltýnek, T., Guerrero-Dib, J., Popoola, O., … Waddington, L. (2023, July 10). Testing
of detection tools for AI-generated text. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2306.15666.
Welding, L. (2023, Mar. 27). Half of college students say using AI on schoolwork is cheating or plagiarism. BestColleges. https://www.
bestcolleges.com/research/college-students-ai-tools-survey/.
Wiggers, K. (2023, Feb. 16). Most sites claiming to catch AI-written text fail spectacularly. TechCrunch. https://techcrunch.com/2023/02/16/
most-sites-claiming-to-catch-ai-written-text-fail-spectacularly/.
Williams, R. (2023, July 7). AI-text detection tools are really easy to fool. MIT Technology Review. https://www.technologyreview.com/2023/
07/07/1075982/ai-text-detection-tools-are-really-easy-to-fool/.
Winston.ai. (2023, Feb. 14). Best AI detectors in 2023 compared. https://gowinston.ai/best-ai-detector/.
Writer. (2023). AI content detector. https://writer.com/ai-content-detector/.
Yan, D., Fauss, M., Hao, J., & Cui, W. (2023). Detection of AI-generated essays in writing assessments. Psychological Test and Assessment
Modeling, 65(1), 125–144. https://www.psychologie-aktuell.com/fileadmin/Redaktion/Journale/ptam_2023-1/PTAM__1-2023_5_
kor.pdf.
ZeroGPT. (2023). GPT-4, ChatGPT & AI detector by ZeroGPT: detect OpenAI text. https://www.zerogpt.com/.
18  William H. Walters

Appendix 1

Topics of the ChatGPT Papers


Although most of the paper topics were suggested by personal experience with students and their written
work (Walters et al., 2020), about two dozen websites were consulted for additional ideas. Topics 24, 33, and 42
are similar to those suggested by Paperell.net (2023). Topics 19, 22, and 37 are similar to those suggested by
Sarikas (2020), Allison (2023), and Kearney (2022), respectively. Topics 1–8 are in the humanities, 9–33 in the
social sciences, and 34–42 in the natural sciences:
1. Why was Stonehenge built? What are the most likely explanations, and what evidence supports or
challenges each of them?
2. What were the causes of the Second Boer War (1899 to 1902)? What did the British Empire, the South
African Republic, and the Orange Free State each hope to achieve?
3. What major nineteenth-century literary works received initially negative reviews but are now regarded as
key contributions to literature? What accounts for the changing opinions of these works?
4. What studies best demonstrate how quantitative methods can be applied to the analysis of English-
language literary works?
5. When did unicorns first appear in literature? How has the depiction of unicorns and their characteristics
changed over time?
6. Will languages other than English gain importance over time as languages of scientific discourse?
7. What accounts for the dominance of American and British songwriters and musicians in twentieth- and
twenty-first-century popular music? Why did no other countries’ artists have a similar impact?
8. What are the historical origins of the religious concept of purgatory? Who put forth the concept of
purgatory? Was it accepted initially? When and how did it assume its place within Catholic theology?
9. Among retired Americans and those approaching retirement, are there distinct types of migration or geo-
graphic mobility (distinct groups of migrants)? What are the distinctive characteristics of each type or group?
10. What were the unintended effects of China’s one-child policy? How have the Chinese government and the
Chinese people responded to them?
11. How have ride-sharing services such as Uber and Lyft influenced overall employment in the taxi and ride-
sharing industry? How have they influenced wages?
12. In the present-day United States, what are the most effective strategies by which wealthy individuals can
minimize their income tax payments?
13. What are the long-term economic and political impacts of the global shortages of copper, lithium, nickel,
and cobalt?
14. What is the best way to determine the impact of Brexit on the UK economy?
15. Why did the US government first institute minimum wage laws? What were they hoping to achieve?
16. Among American college students, to what extent do self-reported assessments of ability represent self-
efficacy rather than ability?
17. Can synchronous demonstrations, delivered online, be just as effective as in-person lab instruction for
undergraduate biology courses?
18. Can the educational success of US charter schools at the high school (secondary) level be attributed to
factors other than the socioeconomic characteristics of their students?
19. Do students who get free meals in grades P–5 do better academically than students of similar backgrounds
who do not get free meals?
20. Is there evidence to support the idea that high school math teachers who struggled with math can be more
effective than those for whom math came easily?
21. To what extent are university students’ evaluations of their instructors related to the difficulty of the
course? What is the best way to overcome any bias related to the link between teaching evaluations and
course difficulty?
Effectiveness of Software Designed to Detect AI-Generated Writing  19

22. What are the advantages and disadvantages of taking a “gap year” of employment or volunteer work
between high school and college – for individuals and for society?
23. Are there systematic differences in the organizational leadership styles of men and women? To what
extent are they unique to either women or men?
24. Who were the most successful businesswomen of the twentieth century?
25. Internationally, how have Patrick S. Atiyah’s “Accidents, Compensation and the Law” and “The Damages
Lottery” influenced legal education, practice, and theory?
26. What are the military missions or situations for which aerial drones have proven most successful? In what
areas do they have the greatest unmet potential?
27. What occupations are most likely to disappear entirely over the next 20 years?
28. In the United States, what safety-related innovations (devices, policies, or procedures) were once man-
dated by law or regulation but later abandoned? Why were they abandoned? On what grounds should
safety-related innovations be evaluated?
29. Do the fans at a football stadium influence the outcome of the game? Can we isolate the impact of the fans’
behavior from the impact of having home-field advantage (and more fans in the stadium)? [Both GPT-3.5
and GPT-4 interpreted this question in terms of association football (soccer) rather than American
football.]
30. Across nations, what is the influence of gun control legislation on rates of gun-related homicide, suicide,
and accidental death? What factors make these comparisons potentially difficult?
31. Are adolescents who play violent video games especially likely to commit acts of violence? Do violent video
games have other negative (or positive) psychological effects?
32. In terms of recruiting, training, and managing personnel, what are the most effective methods of pre-
venting police violence against the public (“police brutality”)?
33. What percentage of political assassination attempts are successful? What evidence can be used to address
this question?
34. At the individual level, what is the impact of professional dental care on morbidity and mortality risk?
35. How harmful are e-cigarettes to the health of those who use them, relative to conventional cigarettes?
36. To what extent do sleep disorders influence the productivity of the American labor force?
37. Can cloning or similar methods be used to bring back extinct plant species? Extinct animal species?
38. What strategies have proven most effective as methods of stabilizing and increasing the orangutan
population?
39. To what extent can global climate change be attributed to ruminant grazing and dairy farming?
40. What is the best way to gauge the environmental impact of a large-scale switch to electric vehicles for
private passenger transportation in the United States? Account for the impact of the vehicles themselves as
well as the need to generate electricity from sources such as natural gas, coal, nuclear, wind, and
hydropower.
41. Which island nations and coastal nations will be most affected by climate change? What steps are they
taking to prepare?
42. How are molten salt reactors different from conventional nuclear fission reactors? What are their unique
advantages and disadvantages? In what ways are they more or less safe than conventional fission reactors?

Appendix 2

Responses Provided by the AI Text Detectors


The numbers in the n columns indicate the number of documents in each response category across all three
document types – GPT-3.5, GPT-4, and human generated.
20  William H. Walters

Content at Scalea

Response n

Responses counted as AI 57
Highly likely to be AI generated! (10–29% human) 8
Likely to be AI generated! (33–44% human) 17
Likely both AI and human! (60–79% human) 32
Responses counted as uncertain 19
Unclear if it is AI content! (45–58% human) 19
Responses counted as human 50
Highly likely to be human! (80–100% human) 50
a
The descriptive text appears to be based primarily on the detector’s confidence in the assessment, while the numeric results appear to
reflect the percentage of the text that is AI.

ContentDetector.aia

Response n

Responses counted as AI 38
Likely AI content (How artificial is your content: 67–82%) 38
Responses counted as uncertain 34
Unclear (How artificial is your content: 50–67%) 34
Responses counted as human 54
Likely human content (How artificial is your content: 16–50%) 54
a
These results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.

Copyleaksa

Response n

Responses counted as AI 84
Suspected cheating: AI text detected. Very high. We are unable to verify 84
that the text was written by a human
Responses counted as uncertain 0
(None) 0
Responses counted as human 42
(No AI-related alerts associated with the text) 42
a
Copyleaks provides an overall descriptive assessment for the entire document along with statements such as “93.3% probability
for human” or “94.8% probability for AI” for particular parts of the document. Those numeric values indicate the detector’s
confidence in the assessment – not the percentage of the text that is AI. Moreover, the percentages reported by Copyleaks are not
actual probabilities, since “30% probability for human” does not mean “70% probability for AI.” It simply means “This text is
probably human generated, and our confidence in that assessment is 30 on a scale from 1 to 100.”

Crossplaga

Response n

Responses counted as AI 71
This text is mainly written by an AI (No % score) 8
Effectiveness of Software Designed to Detect AI-Generated Writing  21

This text is mainly written by an AI (67–100% AI) 59


This text is co-written by both a human and an AI (50% AI) 4
Responses counted as uncertain 0
(None) 0
Responses counted as human 55
This text is mainly written by a human (0–6% AI) 55
a
These results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.

GPT Radara

Response n

Responses counted as AI 54
Likely AI generated (52–84% accuracy) 54
Responses counted as uncertain 0
(None) 0
Responses counted as human 72
Likely human generated (57–83% accuracy) 72
These results appear to indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.
a

GPTZero

Response n

Responses counted as AI 66
Your text is likely to be written entirely by AI 58
Your text is has [sic] a moderate likelihood of being written by AI 8
Responses counted as uncertain 19
Your text may include parts written by AI 19
Responses counted as human 41
Your text is most likely human written but there are some sentences with low perplexities 9
Your text is likely to be written entirely by a human 32

Grammicaa

Response n

Responses counted as AI 68
100% AI 49
91–99% AI 10
81–88% AI 3
50–62% AI 5
39% AI 1
Responses counted as uncertain 4
25–29% AI 3
17% AI 1
22  William H. Walters

Responses counted as human 54


1–8% AI 11
0% AI 43
a
These results indicate the percentage of the text that is AI – not the detector’s confidence in the assessment.

IvyPandaa

Response n

Responses counted as AI 60
High risk 52
Relatively high risk 8
Responses counted as uncertain 29
Medium risk 29
Responses counted as human 37
Relatively low risk 37
a
These results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.

OpenAI

Response n

Responses counted as AI 58
Likely AI generated 32
Possibly AI generated 26
Responses counted as uncertain 21
Unclear if it is AI generated 21
Responses counted as human 47
Unlikely AI generated 5
Very unlikely AI generated 42

Originality.aia

Response n

Responses counted as AI 84
100% AI 80
98–99% AI 3
70% AI 1
Responses counted as uncertain 2
33–34% AI 2
Responses counted as human 40
15–25% AI 4
0–7% AI 36
a
These results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.
Effectiveness of Software Designed to Detect AI-Generated Writing  23

Saplinga

Response n

Responses counted as AI 53
97–100% AI 36
81–94% AI 7
73–79% AI 10
Responses counted as uncertain 35
61–68% AI 8
52–58% AI 7
40–49% AI 11
30–38% AI 8
“Unexpected error” 1
Responses counted as human 38
20–29% AI 8
10–19% AI 11
3–9% AI 7
0% AI 12
a
These results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI. The “Unexpected error”
message persisted even after repeated attempts to conduct the analysis.

Scribbra

Response n

Responses counted as AI 72
100% AI 51
93–99% AI 10
81–86% AI 2
55–73% AI 5
45% AI 1
31–36% AI 3
Responses counted as uncertain 1
25% AI 1
Responses counted as human 53
1–7% AI 10
0% AI 43
a
These results indicate the percentage of the text that is AI – not the detector’s confidence in the assessment.

SEO.aia

Response n

Responses counted as AI 82
Your text appears AI generated (Probability for AI is 71–100%) 82
Responses counted as uncertain 30
Your text appears uncertain to determine (Probability for AI is 45–70%) 30
Responses counted as human 14
Your text appears human made (Probability for AI is 1–37%) 14
a
These results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.
24  William H. Walters

TurnItIna

Response n

Responses counted as AI 84
100% AI 83
84% AI 1
Responses counted as uncertain 0
(None) 0
Responses counted as human 42
0% AI 42
a
These results indicate the percentage of the text that is AI – not the detector’s confidence in the assessment.

Writer

Response n

Responses counted as AI 60
You should edit your text until there’s less detectable AI content (0–90% human-generated content) 60
Responses counted as uncertain 0
(None) 0
Responses counted as human 66
Looking great! (92–94% human-generated content) 10
Fantastic! (96–100% human-generated content) 56

ZeroGPTa

Response n

Responses counted as AI 78
Your file content is AI/GPT generated (63–100% AI) 70
Your file content is most likely AI/GPT generated (61–85% AI) 4
Your file content is likely generated by AI/GPT (55% AI) 1
Most of your file content is AI/GPT generated (32–61% AI) 3
Responses counted as uncertain 15
Your file content contains mixed signals, with some parts generated by AI/GPT (36–48% AI) 4
Your file content is likely human written, may include parts generated by AI/GPT (13–50% AI) 5
Your file content is most likely human written, may include parts generated by AI/GPT (24–33% AI) 6
Responses counted as human 33
Your file content is most likely human written (13–27% AI) 11
Your file content is human written (0–17% AI) 22
a
ZeroGPT provides both numeric values (which indicate the percentage of the text that is AI) and text descriptions (which appear to
reflect the numeric values as well as the detector’s confidence in the assessment). The text descriptions do not always correspond to
specific numeric values.

You might also like