Few Shot Is Enough: Exploring Chatgpt Prompt Engineering Method For Automatic Question Generation in English Education

Education and Information Technologies
https://doi.org/10.1007/s10639-023-12249-8
Few‑shot is enough: exploring ChatGPT prompt

engineering method for automatic question generation
in english education
Unggi Lee1 · Haewon Jung2 · Younghoon Jeon3 · Younghoon Sohn1 ·

Wonhee Hwang4 · Jewoong Moon5 · Hyeoncheol Kim1
Received: 19 April 2023 / Accepted: 28 September 2023

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
2023
Abstract
Through design and development research (DDR), we aimed to create a validated
automatic question generation (AQG) system using large language models (LLMs)
like ChatGPT, enhanced by prompting engineering techniques. While AQG has
become increasingly integral to online learning for its efficiency in generating ques-
tions, issues such as inconsistent question quality and the absence of transparent and
validated evaluation methods persist. Our research focused on creating a prompt
engineering protocol tailored for AQG. This protocol underwent several iterations
of refinement and validation to improve its performance. By gathering validation
scores and qualitative feedback on the produced questions and the system’s frame-
work, we examined the effectiveness of the system. The study findings indicate that
our combined use of LLMs and prompt engineering in AQG produces questions
with statistically significant validity. Our research further illuminates academic and
design considerations for AQG design in English education: (a) certain question
types might not be optimal for generation via ChatGPT, (b) ChatGPT sheds light on
the potential for collaborative AI-teacher efforts in question generation, especially
within English education.
Keywords Automatic Question Generation · Prompt Engineering · ChatGPT · Large

Language Model · English Education
1 Introduction
With the rise of online learning, there is a pressing need to effectively gauge learn-
ers’ understanding and progress. While online courses offer content in a rich and
convenient manner, assessments remain their backbone. They not only measure
understanding but also reinforce learning. However, crafting these assessments,
Extended author information available on the last page of the article
13
Vol.:(0123456789)
particularly questions, poses a challenge. Teachers grapple with vast student data
and often find the question development process to be laborious and time-consum-
ing. Automatic question generation (AQG) involves using computational methods to
automatically generate questions from the provided dataset. Amplified by artificial
intelligence (AI), AQG offers a streamlined approach to assessment development. It
not only alleviates the manual workload of educators but also introduces scalability
to assessments (Das et al., 2021). As the digital education landscape evolves, the
integration of tools like AQG transitions from being merely beneficial to essential,
ensuring a more efficient learning experience. Specifically, there has been a growing
interest in applying AQG systems in English education. AQG can be a great help
for teachers, as it allows them to create questions based on provided prompts. This
makes it easier to create formative assessments without additional resources, making
the process more efficient and resourceful (Cao et al., 2022; Larrañaga et al., 2022).
While previous AQG research in education has demonstrated potential, certain
technical challenges remain when integrating AQG into real-world settings and
implementations. In a recent review by Kurdi et al. (2020), existing AQG applica-
tions in educational research face two major concerns: (a) the inadequate quality
of question generation and (b) the low validity and reliability of evaluation meth-
ods used to determine the quality of questions generated through AQGs. Previous
research suggests that AQG tends to generate low-quality questions because they
largely rely on hand-crafted templates (Alsubait et al., 2016; Kurdi et al., 2020).
Although research has already demonstrated the integration of machine learning-
based models, such as the sequence-to-sequence approach (Gao et al., 2018) and
logistic regression approach (Van et al., 2021), these attempts generally exhibit low
validity and reliability in evaluation methods (Kurdi et al., 2020).
A prevailing skepticism surrounds the reliability of AQG architectures. Many AI-
driven AQG systems sometimes falter in generating contextually-appropriate ques-
tions, often due to biases and imperfections in their training data. Furthermore, they
may not possess the domain-specific expertise required to formulate pertinent ques-
tions for specialized topics. This underscores the urgency for innovative strategies
that amplify the efficacy and quality of AQG systems. In response, our study seeks
to establish more dependable AQG methods. We employ a prompt engineering tech-
nique for AQG and assess its efficacy. Recent strides in leveraging large language
models (LLMs) like ChatGPT have shown the possibility of high performance with-
out intensive model training (Brown et al., 2020). The growing interest in prompt-
based learning exemplifies this trend, demonstrating outperforming results in vari-
ous linguistic tasks while conserving resources (Liu et al., 2023). A synergistic use
of both an LLM and few-shot learning may pave the way for models that excel across
diverse contexts. However, there is a dearth of empirical studies evaluating the mer-
its of prompt-based learning for AQG, especially in the field of English education.
In this study, we hence conduct design and development research (DDR) that
encompasses multiple cycles of qualitative data collection, analyses, and progressive
tool validations. Our primary objective is to design, validate, and assess the quality
of an AQG system powered by a prompt-based learning technique, with a particu-
lar emphasis on its internal and external validity. We pose the following research
questions as follows: (1) What is the optimal design and integration of a prompt
13
engineering method, serving as AQG, for creating English reading problems?, (2)
To what extent can a prompt engineering protocol with generated English problems
yield high validity and reliability?
2 Problem Statement
The research questions posed in our study have the potential to yield significant
impacts. Concerning the first research question, the design and application of a
prompt engineering method carry several implications. Primarily, it augments the
authenticity and ecological validity of reading assessments by leveraging a range
of genuine reading materials. Such an approach ensures alignment with real-world
reading situations, fostering a transfer of skills beyond the classroom. In addition, a
proficient prompt engineering method can address varied reading proficiencies. By
crafting problems of different complexities, the assessment can span a breadth of
proficiency levels, from beginners to advanced. Such inclusivity guarantees that the
feedback from assessment can be both insightful and actionable for learners, educa-
tors, and policymakers. In addition, the adapted prompt engineering method bolsters
assessment efficiency. Automated systems can produce reading problems, enhancing
the scope of reading skills assessed and streamlining test development.
Turning to the second research question, validity, and reliability are two essen-
tial cornerstones of any assessment instrument. To ensure the credibility of reading
assessments, it is crucial to explore how effectively a prompt engineering protocol,
supported by generated English problems with high validity and reliability stand-
ards, can be used to create such assessments. Probing the validity of this protocol
lets us gauge whether the generated English problems truly tap into the core con-
structs of reading comprehension. Comprehensive validity ensures that assessments
capture learners’ grounded abilities, allowing precise interpretations of their reading
capabilities. Expert feedback further refines content validity, ensuring tight align-
ment with the measured constructs. On the reliability front, it is about consistent
measurement. A detailed examination of the prompt engineering protocol’s reliabil-
ity offers insights into the consistency of the generated problems’ results across test
administrations. This empirical exploration covers areas like internal consistency
and inter-rater reliability, further fortifying the dependability of the test outcomes.
Through this research, we aim to foster advancements in English language assess-
ment. Our ultimate objective is to design, refine, and validate a prompt engineering
method that consistently enhances the validity, reliability, and real-world relevance
of reading assessments.
3 Literature review
3.1 English reading comprehension question
Asking students questions about what they have read is a common practice in
learning activities aimed to bolster reading comprehension. In English education,
13
developing adequate and valid reading comprehension questions has been empha-
sized. Specifically, the types of tests used to assess students’ English reading com-
prehension are critical. Research has investigated the impact of different types of
questions on students’ understanding and memory of text. Reading comprehension,
unlike speaking and writing skills, is an elusive and implicit competency; therefore,
tracking and evaluating students’ reading comprehension necessitates a range of
indirect assessments. Duke and Pearson argue that when students receive specific
types of questions, their comprehension, and memory adapt to those types (Duke &
Pearson, 2009). For instance, if a teacher often poses questions that ask for factual
details, students’ answers are likely to focus on such details in English comprehen-
sion. Consequently, well-designed reading comprehension questions are essential for
encouraging students to think critically, and variations in test types can enable edu-
cators and researchers to measure different subsets of skills (Shohamy, 1984).
Reading comprehension encompasses various levels, and different question
types have been designed to assess students’ reading abilities (Eason et al., 2012).
Although there is little scholarly consensus on these question types, two major Eng-
lish taxonomies have been primarily used. Brown (2007) classified reading perfor-
mance as perceptive, selective, interactive, and extensive, which has been employed
to organize various assessment tasks such as multiple-choice, written response,
cloze tests, etc. (Brown & Abeywickrama, 2004). In addition, Day and Park (2005)
developed a taxonomy that includes six types of reading comprehension (i.e., lit-
eral comprehension, reorganization, inference, prediction, evaluation, and personal
response) and five forms of questions (i.e., yes–no (Y-N) questions, alternatives,
true or false, wh-questions, and multiple- choice) (Table 1). This taxonomy enables
teachers to create or assess questions based on their structure and answer formats,
as well as to identify the specific types of reading comprehension being tested. For
example, a question that requires recalling specific factual details can be both a lit-
eral comprehension and a wh-question.
Crafting good reading comprehension questions is no easy task. Teachers should
have a thorough understanding of the text and choose appropriate question types and
forms to accurately assess the student’s comprehension. This process requires sig-
nificant time and effort. Creating multiple-choice questions can be even more chal-
lenging as teachers need to generate plausible answer options that require careful
consideration of detail. This process is likely to be time-consuming and demanding
for teachers. Also, manual English question creation could result in limited variety
and quality, potentially leading to biases in selecting English questions, as it relies
on individual teachers’ backgrounds and expertise levels. It echoes the need for
question-generation techniques in English education.
3.2 Automatic question generation in English education
One promising area where AI has been applied in English education is AQG. Cre-
ating reading comprehension questions can be a time-consuming task for teachers,
and AQG technology, utilizing language models, has the potential to generate and
evaluate questions swiftly and with dependable results. AQG technology can offer
13
Table 1 The Taxonomy of English Reading Comprehension Questions (Day & Park, 2005)
Forms Type of Comprehension

Literal Reorganization Inference Prediction Evaluation Personal Response
Yes/No Grasping the explicit Using information Questions whose Determining what Assessing the overall Reacting emotionally
Alternative meaning of the text from different parts answers are not might happen next or significance or value or personally, based
of the text to gain explicitly stated in after a story ends of the text on understanding of
True or False the text
additional meaning the text
Who/What/
When/Where/
How/Why
Multiple Choice
13
teachers valuable insight into the key characteristics of effective question creation
and evaluation, representing an essential step towards the development of meta-cog-
nitive knowledge about reading comprehension instruction. AQG can be advanta-
geous for monitoring student responses and providing feedback for each answer, a
task that may be challenging for a teacher dealing with a large class. Hence, AQG is
expected to relieve teachers of the burden when creating questions for basic compre-
hension and enable them to focus on higher-order issues in instructional design and
classroom management.
Aligned with this interest, a line of AQG research has been conducted in Eng-
lish education. Soonklang and Muangon used an AQG-supported English exercise
system for secondary students and explored their user experiences with the system
(Soonklang & Muangon, 2017). Morón et al. (2021) developed an AQG for begin-
ners. Using NLP techniques, their AQG tool leveraged linguistic information such as
semantic roles, co-references, and named entities to generate questions and answers
from a text selected by a teacher.
Despite the increasing attention to AGQ applications in education fields, there
is a lack of a strong theoretical foundation and its connection with AQG in English
education. Current AQG research in the English education field tends to focus on
technical issues, such as developing AGQ system architecture, rather than consider-
ing question types and forms aligned with the English education curriculum. Moreo-
ver, existing systems often have limited question types and rather focus on specific
linguistic information such as grammar, rather than higher skills for reading com-
prehension of various texts. In order to effectively use AQG technology in English
education, it is essential to seek ways to systematically generate question types that
are aligned with various needs in English education and to conduct iterative design
and validations of AQGs. Legitimizing the entire process of AGQs and testing the
question generation with low-resource learning techniques (i.e., Zero-to-few shot
learning) can be crucial.
In order to address the limitations of AQG, the use of neural-based question-gener-

ation techniques has become prevalent (Pan et al., 2019). These techniques utilize
deep neural networks to generate questions from a provided input, effectively auto-
mating the generation process. This approach is particularly useful in the context of
English language education because it helps learners develop their language skills in
a more interactive and engaging way. However, these models require extensive train-
ing data and a technical understanding of deep neural networks to generate high-
quality questions, presenting a significant challenge for individual educators when
utilizing AQGs effectively.
Prompt engineering has emerged as an approach that allows for effective neural-
based AQGs with limited data sets (Oppenlaender, 2022). This technique involves
providing hints, or prompts, to the language model to aid in understanding the task
at hand and generating the desired results. With the use of LLMs, individuals can
achieve the desired results even with small data sets and without extensive technical
13
knowledge, as long as appropriate and accurate prompts aligned with the domain
are provided (Zhang et al., 2021). For example, ChatGPT, the current topic of our
interest, utilizes a prompt engineering approach by enabling users to easily produce
high-quality text generation results through a combination of prompts, pertaining to
their desired domain knowledge (Floridi & Chiriatti, 2020).
Although it is widely acknowledged that prompt engineering with ChatGPT has
demonstrated superior performance compared to other state-of-the-art methods on
benchmark data sets (OpenAI, 2023), there remains a dearth of empirical and vali-
dation studies on AQG tasks employing ChatGPT (Ben Abacha et al., 2016). Given
the vast range of potential applications in the fields of English education, few studies
have underscored the significance of the applicability and validity of ChatGPT and
prompt engineering in AQGs.
4 Method
4.1 Participants and procedure
As outlined earlier, our study poses two key research questions: (1) What is the opti-
mal design and integration of a prompt engineering method, serving as AQG, for
creating English reading problems?, (2) To what extent can a prompt engineering
protocol with generated English problems yield high validity and reliability? To
answer these questions, we adopted the design and development research (DDR)
methodology, which is suitable for designing and validating technology-based edu-
cational intervention (Richey & Klein, 2005). Richey and Klein (2005) describe
DDR as a systematic exploration of design, development, and evaluation processes,
aimed at grounding the creation of both instructional and non-instructional prod-
ucts with empirical evidence. DDR can cover a spectrum of studies: from examin-
ing specific instructional design efforts, investigating the contexts in which tools are
designed, developed, and evaluated, and analyzing the overall process or its distinct
components. In the present study, the utilization of DDR holds significance because
it offers a structured blueprint for scrutinizing the entire journey of design, develop-
ment, and evaluation, anchoring the creation of AQG-driven assessment products.
Generally, DDR comes in two primary forms: The first type aims to describe and
focus more on the product or program evaluation. The second type is a thorough
illustration of design, development, or evaluation processes, in its entirety or as a
specific component. Given our goal of optimizing the design of AQG for English
reading problems, we found the first DDR type more appropriate. This approach
underscores engagement with stakeholders, experts, and users throughout the DDR
process, ensuring that the proposed intervention is theoretically and practically
sound. Our objective was twofold: to design and validate an AQG prompt engineer-
ing protocol and to assess the quality of generated questions aligned with English
education. As our intervention is in its early stage, we recognize the need for ongo-
ing refinement of the prompts, informed by continual testing and validation. Aligned
13
with the DDR framework, we went through multiple rounds of iterative validation
and refinement.
4.2 Tools and intervention
For this study, we initially utilized the GPT-3 model to create English problems.
However, with the subsequent release of ChatGPT by OpenAI, our research takes a
new direction. After assessing ChatGPT’s capabilities, we decided to shift gears and
regenerate all questions using this newer model in preparation for our second valida-
tion phase. As a result, we updated our prompting manuals to reflect this change.
To leverage ChatGPT in crafting English questions, we selected two passages
featured in the Korean SAT exam. The structured and consistent format of English
reading questions in the Korean SAT makes them ideal templates for the automated
generation of questions (Kim & Kang, 2012). To ensure a comprehensive evaluation
that captures potential variations in question types, we selected one literary passage
and one non-literary passage for the exercise.
In this study, we employed a combination of interviews and questionnaires to vali-

date both our prompting manual and the questions we generated from experts’ and
English teachers’ reviews. The interviews consisted of predefined questions aimed at
assessing the quality of the manual and the generated questions. The questionnaire
sought feedback on the manual’s usability and validity, as well as the validity of
each individual question.
4.4 Procedure
4.4.1 AQG and protocol development
In the first phase, our objective was to develop a prototype for the protocol. We
started by reviewing current methods and techniques in prompt engineering through
an extensive literature review. Using evidence and case studies on prompt engineer-
ing and AQGs, we structured the system’s foundational architecture and its valida-
tion benchmarks.
Our prototype development, depicted in Fig. 2, was informed by thorough
research in the following three areas: (a) prompt engineering, (b) AQG, and (c) the
formulation of English reading questions. Given the swift advancements in AI engi-
neering and techniques, we focused on recent literature to ensure our framework lev-
eraged the latest innovations. We honed in on effective prompt generation strategies,
such as designing specialized templates for categorized problems (Shin et al., 2020)
and maximizing outcomes from limited data sets (Floridi & Chiriatti, 2020). In our
AQG literature review, we emphasized works, intersecting technology with English
13
Fig. 1 Matrix about the type of questions that ChatGPT can generate (briefly presented)
education, such as (Brown & Abeywickrama, 2004) and (Morón et al., 2021). This
helped us understand how to craft prompts and deploy AQG using LLMs.
We also sought literature centered on English education and question creation.
Our goal was to uncover clear methodologies and frameworks for question gen-
eration, with an emphasis on practicality. Works providing hands-on examples and
comprehensive methods for English reading question formulation (Brown & Abey-
wickrama, 2004; Day & Park, 2005) proved pivotal. We also evaluated Korean SAT
English reading problems to shape our AQG framework.
Drawing from our literature synthesis findings, we drafted a prompt engineering
manual prototype. This prototype, a blend of guidelines and a 2D matrix framework
(refer to Table 1) based on Day and Park (2005), incorporated examples of prompt
input sentences and resultant questions. Our framework, built using classifications
from the established studies (see Fig. 1), is summarized here due to space restric-
tions. Instead, a detailed version can be accessed via the link in the footnote.1 Each
matrix cell contained a crafted prompt, designed using insights from our literature
review. We tested these prompts with ChatGPT, ensuring the generated questions
were appropriate and content-aligned. Only effective prompts were retained in our
framework.
1
This link is connected to the AQG manual in this research. https://docs.google.com/document/d/1h23D
tAVeKHd1AiUvTlpVN3AG073xRg4vygpu4-o82-s/edit?usp=sharing
13
Table 2 The demographic of the expert group

Profession Experience (years) Description of Expertise
A Engineer 5 B.S. degree in Computer science,

Data mining & NLP engineer
B Research 13 Ph.D. in Machine learning,
director Senior research director of AI research team
C English 19 Ph.D. student in English education,
teacher Senior teacher in a high school
D English 14 Ph.D. student in English education,
teacher Senior teacher in a middle school
E English 15 B.A. degree in both Educational technology
teacher & English education, M.A. in Linguistics,
High school teacher
The protocol’s guidelines primarily instruct users on leveraging ChatGPT and

the framework, founded on research that even with limited data, satisfactory results
can be achieved through prompt engineering in a sizable language model. Users are
directed to input prompts into ChatGPT, referencing our framework and example
prompts, simplifying the process of generating English questions without the need
for delicate prompts or extensive training. After finalizing the protocol, we rigor-
ously tested its entire workflow to confirm its efficacy.
For the actual implementation, we input the English passage and the associated
prompt settings into the application, which then generated the questions. We cata-
loged the generated questions based on their types and the nature of the English
passage. The process was iterative: we continuously refined the prompts to optimize
the results. By comparing the outcomes from revised prompts with previous ones,
we assessed the effectiveness of the modification. Any changes that enhanced the
results were then solidified in the prompt.
4.4.2 Expert validation
In phase two, we invited experts to assess both the protocol and generated English
questions. We assembled a total of five experts, comprising two natural language
processing (NLP) engineers and three English education experts. All three English
education experts have over 10 years of field experience and hold master’s degrees
in their respective domains. For the NLP experts, we prioritized those with hands-
on experience in the domain, ensuring they could provide informed feedback on the
prompt engineering technique. Table 2 provides a detailed breakdown of the experts’
experiences and backgrounds.
For validation, we began by sending each expert a packet containing an overview
of the study, its objectives, and a questionnaire focused on the protocol. This was
followed by 60-min semi-structured interviews with each expert. These interviews
aimed at gathering feedback on the clarity, utility, and relevance of the protocol and
the questions it generated. In the next phase, experts were given a week to com-
plete a questionnaire that assessed as follows: (1) the validity of the implementation
13
Fig. 2 Research process diagram
process used to derive the protocol, (2) the overall validity of the protocol, and (3)
the protocol’s validity in terms of specific prompts.
English education experts also reviewed the validity of detailed prompts generated
by the framework. Our questionnaire combined 4-point Likert scale items for numeri-
cal analysis and open-ended questions for qualitative feedback. We chose a 4-point
Likert scale to avoid neutral responses, which can add ambiguity. A 4-point Likert
scale without a neutral response option can make participants reflect more deeply on
their answers, leading to more genuine feedback. Such scales can help elicit clearer,
more thoughtful responses from participants (Krosnick et al., 2002), thus provid-
ing richer data. After gathering the experts’ feedback, we analyzed their insights and
determined the validation scores. The initial validation results informed revisions to
the protocol, framework, and questions, with subsequent questions updated as needed.
The entire research process is visually illustrated as shown in Fig. 2.
4.4.3 Teacher validation
To evaluate the validity of our protocol and the generated questions, we sought feed-
back from in-service English teachers, who are actively working in the field. This
validation ensured the practicality and accuracy of our tools in real-world educa-
tional scenarios.
For teacher validation, the study participants assessed the usability of the manual,
framework, and the prompts, and the relevance of the questions to generate English
13
Table 3 The demographic of the Gender Experience Type of school

teacher group (years)
A Female 10 High school

B Female 10 High school
C Male 8 High school
D Female 21 High school
E Female 9 Middle school
F Female 10 High school
G Male 7 High school
H Female 6 Middle school
I Female 8 Middle school
education. We adopted a 4-point Likert scale for gaining quantifiable feedback.

Teachers also rated each question type and the ones generated by ChatGPT using
the same scale, emphasizing usability and utility in a general educational setting not
limited to specific exams like the SAT. Before the evaluations, we clearly commu-
nicated the grading criteria, which gauged the protocol’s ability to produce valuable
questions for educators. This structure aimed at the applicability of the AQG system
in real-world contexts.
We engaged nine English teachers, representing a mix of teaching experiences
and school types, middle or high school specifically (details in Table 3). Elementary
school teachers were excluded due to their broader curriculum. By including teach-
ers with varied backgrounds, we aimed to capture diverse reactions to the technol-
ogy, ensuring a holistic evaluation. The profiles of the invited English teachers are
outlined in Table 3.
We furnished the teachers with a comprehensive manual on using ChatGPT for
question generation. If needed, brief meetings among the researchers clarified any
questions. After acquainting themselves with the system, teachers independently
generated English reading comprehension questions using ChatGPT and our frame-
work. The teachers then evaluated these questions based on our revised protocol and
rated each using the provided tools. Their feedback also covered the usability of the
manual and framework and their overall efficacy. We also incorporated open-ended
questions to gain deeper insights and suggestions for refinement.
4.4.4 Validation analysis
After collecting the data from the teacher questionnaires and interviews, we exam-
ined the results to discern areas for improvement. To ensure the validity and reliabil-
ity of our protocol and the generated questions, we used both content validity index
(CVI) and inter-rater agreement (IRA) methods as recommended by Rubio et al.
(2003). The CVI assesses the validity of the test items; we calculated it by divid-
ing the number of positive ratings by the number of participants. We set a threshold
of 0.8, which is indicated as a valid score (Rubio et al., 2003). In contrast, the IRA
evaluates the reliability of the expert ratings, reflecting their consensus about item
13
quality and performance. We computed the IRA by dividing the number of agreed-
upon items by the total number of items (an IRA score higher than 0.8 signifies
reliable items) (Lynn, 1986). To determine the validity of the protocol and frame-
work, we calculated the CVI for each question by dividing the number of experts
who rated the item positively (scores of 3 or 4) by the total number of experts. Like-
wise, we estimated the inter-rate agreement (IRA) by counting the items for which
the IRA was not less than 0.8 (Lynn, 1986).
5 Result
5.1 Prototyping the protocol and framework
Following the literature review, we designed a 2D matrix that aligns the types and
forms of English reading questions, and its protocol for AQGs. The purpose of the
matrix is to assist teachers in efficiently generating English reading questions. The
matrix enables teachers to generate questions covering various question types, even
without background knowledge or experience about generative AI. One dimension
of the matrix consists of the task type of the question: two major categories (i.e.,
literal and inferential) with 11 subcategories. Another dimension includes the for-
mat of the question: three major categories (yes/no, wh-question, and cloze) with
two options (multiple choice and open-ended). We used Korean CSAT English read-
ing questions and taxonomies as references (Day & Park, 2005). At this stage, we
employed GPT-3, a natural language processing model capable of producing texts
that resemble human writing in response to prompts and commands. This deep
learning technique is pre-trained with a large amount of text corpus data, enabling it
to learn human language patterns and structures. Because GPT-3 is more adaptable
than other models in producing complex human-like text outputs, such as language
creation and text completion, we believe it might be a suitable fit.
In the GPT-3-supported prototype testing, we encountered several technical chal-
lenges. First, the prototype exhibited a high failure rate in generating questions for
certain subcategories. In addition, it produced identical results for subcategories that
necessitated similar approaches, even though they belonged to distinct subcategories
within the English education domain. For example, generating a question about the
main ideas required a comparable approach but was classified under different sub-
categories in the English education domain. Despite this distinction, the prototype
yielded identical AQG outputs for both subcategories. As a result, among the 65
possible combinations (13*5), the matrix excluded 35 prompt cases that failed to
produce appropriate AQG using GPT-3.
5.2 Expert validation: first‑round
In the initial protocol, we conducted a first-round validation with experts to verify

the entire procedure of the protocol and assess the major components and catego-
ries of the proposed framework (Table 4). A total of five subject matter experts
13
Table 4 The expert validation results. This means that the item passes the threshold of 0.8, which repre-
sents more than 80% of the raters who had agreed on the item and therefore it is reliable
Mean SD CVI IRA IRA
(for each)
The validity of the process of Comprehensive literature search 3.40 0.89 0.80 0.80 1.00*
deriving the protocol Appropriate use of terms 3.40 0.89 0.80 0.80
Appropriate interpretation of 3.20 0.84 0.80 0.80
literature review
Logical organization of the devel- 3.40 0.55 1.00 1.00
opment
Appropriate reflection on the 3.20 0.84 0.80 0.80
results from the literature
review
The validity of the protocol Validity 3.80 0.45 1.00 1.00 1.00*
Explainability 3.80 0.45 1.00 1.00
Usefulness 3.60 0.55 1.00 1.00
Comprehensiveness 3.40 0.89 0.80 0.80
The validity of the framework Validity 2.80 0.45 0.80 0.80 1.00*
Explainability 3.20 0.84 0.80 0.80
Usefulness 2.80 0.45 0.80 0.80
Comprehensiveness 3.00 0.71 0.80 0.80
(three from the field of English education and two from NLP fields) participated.
We employed mean, CVI, and IRA to quantitatively describe the results. Tables 4
and 5 presents the expert validation outcomes. During the expert validation, the CVI
for each question ranged from 0.8 to 1.0, with an average of 0.84, which clearly
exceeded the threshold (0.8). The IRA was 1.0 for every section, indicating a strong
consensus among the experts regarding the validity of the prototype. These findings
demonstrate that the protocol and framework prototypes exhibit acceptable content
validity.
While the experts deemed the protocol and framework valid, there were some
areas of disagreement. One NLP expert assigned low scores to the derivation proce-
dures, universality, and comprehensiveness of the protocol. Derivation procedures
pertain to the method employed to create a protocol. Comprehensiveness indicates
the extent to which the protocol addresses all relevant issues, which universality con-
cerns its applicability across diverse situations. The expert specifically pointed out
that the low derivation procedures might result from insufficient guidance provided
by the protocol. They recommended incorporating additional figures and concrete
examples to improve the protocol’s comprehensiveness and universality. In addition,
the expert suggested that if the categorization of closed versus open-ended questions
was to be retained, it should also be applied to the cloze types as well.
English education experts assigned low scores to some of the prompts developed
with the framework. Although some items were dismissed due to low CVI scores,
the IRA scores provided an alternative interpretation of the findings. IRA scores
did not meet the required threshold for all categories except for topic sentences
13
Table 5 The result of expert validation of the prompts. This means that the item passes the threshold of
0.8, which represents more than 80 percent of the raters who had agreed on the item and therefore it is
reliable
Mean SD CVI IRA IRA
(for each)
Corresponding response Y-N Q Multiple-choice 3.33 0.58 1.00 1.00 0.60

Open-ended 3.67 0.58 1.00 1.00
Wh Q Multiple-choice 3.67 0.58 1.00 1.00
Open-ended 3.00 1.00 0.67 0.67
Cloze - 2.67 1.15 0.33 0.67
Topic sentence Y-N Q Multiple-choice 0.33 0.58 1.00 1.00 1.00*
Open-ended 3.33 0.58 1.00 1.00
Open-ended 4.00 0.00 1.00 1.00
Cloze - 1.33 0.58 0.00 1.00
Purpose Y-N Q Multiple-choice 1.33 0.58 0.00 1.00 0.75
Open-ended 3.33 0.58 1.00 1.00
Open-ended 3.33 1.15 0.67 0.67
Inferential question Y-N Q Multiple-choice 3.33 1.15 0.67 0.67 0.20
Open-ended 3.67 0.58 1.00 1.00
Open-ended 2.67 1.53 0.67 0.67
Cloze - 2.67 1.53 0.67 0.67
Tone and mood Y-N Q Multiple-choice 3.67 0.58 1.00 1.00 0.60
Open-ended 3.67 0.58 1.00 1.00
Open-ended 3.00 1.00 0.67 0.67
Cloze - 2.67 1.53 0.67 0.67
Main idea Y-N Q Multiple-choice 3.33 0.58 1.00 1.00 1.00*
Open-ended 3.67 0.58 1.00 1.00
Open-ended 4.00 0.00 1.00 1.00
Cloze - 1.00 0.00 0.00 1.00
and main ideas. This outcome suggests that the validation outcomes of the prompts
may vary depending on the specific background of the English education experts.
While two experts, who were doctoral students in English education, provided
detailed feedback on linguistics, another expert, who majored in educational tech-
nology, expressed greater concern regarding the prompts’ adaptation in real-world
settings. Although it remains debatable whether all low CVI items necessitate fur-
ther revisions, items with both low CVI and IRA ratings certainly require modifica-
tions. To pinpoint the main cause of discrepancies, we examined the experts’ open-
ended survey responses. This data was utilized to identify gaps in the prompts or the
13
assessment procedure, and we made improvements to enhance the prompts’ validity

and reliability.
All cloze questions received low CVI scores, and experts unanimously agreed
that cloze questions asking for topic and main idea sentences exhibited low validity.
Low scores appeared to be caused by issues with the prompts and question formats
produced by GPT-3, such as questions with more than two potential answers and
missing instructions, which were identified by the experts. In particular, the experts’
assessments coherently indicated that the questions generated invalid results because
they failed to measure what they were supposed to measure. Some questions, how-
ever, such as tone and mood questions in the form of open-ended wh-questions, con-
centrated on valid content, and their validity was debated. For example, one expert
expressed concern that the question’s validity was low due to the ambiguous answer
choices, whereas the other two experts acknowledged its limitations but still consid-
ered the question to be accurate in determining the mood of the given text. The IRA
scores were generally below the threshold due to variations in expert perspectives.
We did not discard items with low IRAs because the experts provided clear feedback
and both views had pertinent points for the revision of the prototype. Instead, we
contemplated revising these items based on the feedback provided by the experts.
All results can be seen in Table 5.
After reviewing the written responses from the experts, we organized and tabu-
lated them to identify the primary reasons for discrepancies. Upon examining expert
feedback, we classified the issues into the following categories: (1) the categoriza-
tion of cloze questions and the prompts; (2) questions with more than two potential
solutions; (3) a vague distinction between correspondence questions and inferential
questions; (4) unclear prompts; and (5) reorganization of the framework.
5.3 Revision
Based on the first-round validation results, we revamped the existing design of the
prompts and protocol. Experts pinpointed the areas of refinement, especially under-
scoring the necessity for prompt fine-tuning to enhance AQG performance across
subcategories. Responding to this, we transitioned from using the GPT-3 model to
ChatGPT. This switch was motivated by ChatGPT’s design, which is optimized for
conversational interactions, ensuring a higher fidelity to the directions encapsulated
within the input prompts.
Our revisions to the prototype were informed by expert insights, leading us to
revamp both the framework and prompts. Addressing the five key areas identified for
improvement, we meticulously refined prompts by adjusting terminology and incor-
porating requisite phrasings. This nuanced approach ensured the prompts encapsu-
lated clearer intent and context, generating questions that were both accurate and
informative.
One area that stood out for revision was the cloze tests. Their original format
exhibited subpar validity and structural inconsistencies. Experts noted these tests
often overlooked the crux of the provided passages. Understanding the significance
13
of prompt engineering, we pivoted our strategy for cloze tests. We categorized them
into two distinct formats: multiple-choice and open-ended questions. By infus-
ing fresh textual elements into the prompts, we witnessed a marked improvement
in question generation accuracy. These iterative revisions improved the cloze tests’
quality and utility. The following section delves deeper into the prompt modifica-
tions, explaining the expert-driven refinements.
1. The change in the cloze questions’ subcategorization and prompt

• Rephrasing the prompts: Since the experts pointed out that the format of cloze
questions was poor, the prompts were rephrased with more specific orders. For
example, `Make a cloze test with multiple-choice at the sentence that contains
the topic of the text. The cloze test should contain the whole text and the cloze.
Put the options at the end of the text.’
• Making new subcategories: The experts pointed out that the cloze question type
can contain both a multiple-choice version and an open-ended version. There-
fore, cloze questions with multiple-choice type and cloze questions with open-
ended type were created.
2. The change in boundary between the correspondence questions and the inferential
questions
• Correspondence question, cloze test, open-ended: The expert also pointed out
the boundary between the correspondence questions and inferential questions
was uncertain so that the prompts needed to be reorganized. Thus we deliber-
ately utilized words explicitly’ to distinguish those two types of questions. For
example `Make a cloze test with information that is explicitly shown in the text.
Rewrite the whole text with the blank cloze. Write a sentence to order me to fill
in the cloze with the correct answer’
• The inferential question, cloze test, open-ended: With the inferential question,
we used the word ’implicitly’ in the revised prompt. For example, `Make a cloze
test on a sentence that contains information that can be implicitly inferred from
the text. Rewrite the whole text with the cloze.’
3. Deleted cloze tests in the literal questions category
• According to the expert validation, having cloze tests in the literal(identify,
check) questions category is inappropriate because of its own feature, which
basically needs inferential thinking. Thus cloze tests in the literal questions cat-
egory were deleted.
5.4 Teacher validation: second‑round
In the second validation phase, we sought feedback from English teachers on the
updated protocol, matrix, and associated prompts. Conducting a survey similar to
the first round, our goal was to validate the effectiveness and quality of the modi-
fications. We engaged eight English teachers from diverse teaching backgrounds
for this purpose. They were trained to first familiarize themselves with the proto-
col by completing a questionnaire. The questionnaire aimed at gauging their per-
ception of the protocol’s comprehensiveness, clarity, and utility. Comprising 44
13
Table 6 The teacher validation Mean Variation CVI IRA IRA

result for overall usability (each item)
Comprehensiveness 2.88 0.41 0.75 0.75 0.67

Clarity 3.00 0.29 0.88 0.875
Usefulness 3.36 0.27 1.00 1
items, it employed a 4-point Likert scale for quantitative data collection, supple-
mented with open-ended richer qualitative insights. We then collated and scruti-
nized the collected data, focusing on metrics like means, variation, CVI, and IRA
to derive validation outcomes. The results of the teacher validation, particularly
in regard to overall usability metrics (i.e., comprehensiveness, clarity, and useful-
ness), are presented in Table 6.
The areas of clarity and usefulness had an overall CVI score that exceeded the
acceptable threshold of 0.8. However, the score for comprehensiveness was below
the threshold, with a score of 0.75. Although the overall IRA score of 0.67 may
suggest low reliability, each item had an IRA score above 0.8, except for compre-
hensiveness, indicating high reliability for the other two items. In their qualitative
feedback, teachers noted that the protocol and matrix provided a clear and easy-
to-follow guide. However, they expressed concerns about the accessibility of the
protocol and matrix for those who are not familiar with AQG technology. Teach-
ers found the examples provided with the protocol to be useful in enhancing their
understanding. To address the comprehensiveness issue, we have decided to offer
the guide with multi-modal stimuli, such as pictures, diagrams, or video clips, to
provide more detailed explanations about the technology.
The overall mean score across all prompts was 3.35 (with SD of 0.25), with
an average CVI score of 0.89. These results indicate that the majority of gener-
ated questions accurately measured what they were intended to assess. The over-
all IRA score was 0.76, which is close to 0.8 and suggests that the results are fair
and reliable.
The majority of items with CVI scores lower than 0.8 belonged to the `yes–no
questions with multiple choices’ types (including A2, A3, A4, in Fig. 1, which
are multiple-choice and also main idea, purpose, inferential type question each,
displayed in Table 5) and `cloze’ types (including F2, F4, F6, in Fig. 1, which
are cloze and main idea, inferential, title tye question). Based on the teachers’
comments, `yes–no’ questions generated by ChatGPT were deemed less appro-
priate for regular English classes, which helped explain the low scores. The
teachers elaborated that `yes–no’ questions are often used for basic comprehen-
sion checks because they usually have a simple sentence structure. However, the
complex sentence structures frequently found in questions generated by the Chat-
GPT could be distracting. Consequently, test takers may spend more time inter-
preting the complex sentence structure rather than focusing on the main ideas
of the question, leading to lower validity. The teachers also gave low ratings to
the ’cloze’ questions, citing that cloze questions generated by ChatGPT did not
13
meet conventional design standards of cloze questions. They anticipated the gen-
eration of cloze questions where blanks replace intended words or phrases, with
contextual hints embedded within the surrounding text. However, when simply
instructed to generate ’cloze’ questions through prompt, the ChatGPT did not
appear to have information about the requirements of cloze questions expected by
teachers.
Aside from the `yes–no’ and `cloze’ question types, several other items received
CVI scores below 0.8. These involved open-ended questions that asked for `yes–no’
answers about the purpose of the text (B3), open-ended questions that required
inferential content from the text (D4), and sentence insertion questions with multi-
ple choices (C9). According to the teachers’ feedback, B3 and D4 seemed to focus
on irrelevant parts of the text, indicating a low level of validity. Due to its unclear
structure and lack of numbering, C9 was given a low rating because it could confuse
test-takers when attempting to answer queries.
5.5 Finalized protocol and generated questions
The AQG protocol underwent refinements following feedback from experts and
teachers. One notable adjustment was the transition from GPT-3 to ChatGPT. This
decision stemmed from insights. First, we encountered technical challenges with
GPT-3, particularly its high failure rates in question generation. It became evident
that a system optimized for conversational interactions was needed to prompt teach-
ers to craft natural and precise questions. Second, the diversity of question input
styles demanded a versatile approach. ChatGPT, designed to manage varied input
styles and adapt to diverse prompts, emerged as a more flexible and applicable solu-
tion. Our updated protocol can encompass the research objectives, an AQG guide,
the matrix, examples of input prompts, and the outcomes of AQG. It lays out a com-
prehensive strategy of AQG, driving the combined strengths of the matrix and Chat-
GPT as follows:
1. Select a passage to generate English reading questions.

• The length of the passage should be limited to under 250 words to control the
quality of output, due to the capability of ChatGPT to comprehend long textual
input.
• It is highly recommended that educators carefully choose passages from a diverse
range of levels and topics that align effectively with the specified learning objec-
tives.
2. Select the task type and format of the question. (Refer to the matrix from the
Appendix for detailed categories.)
3. Take prompt input of selected type and format from the matrix.
4. Input the selected passage and the prompt from the matrix to ChatGPT.
5. Generate response from ChatGPT.
• Details of the input prompt could be modified for fine-tuning of the outcome.
13
6. Before using the generated question, review the output and go through revision for
possible errors in the question. Errors to take close attention would be unclear use
of language, typos, and alignment of generated questions and initial anticipation
from instruction in terms of the format of generated question
7. Utilize the final outcome in educational circumstances.
6 Discussion, Implication, and Conclusion
6.1 General discussion
6.1.1 RQ1. What is the optimal design and integration of a prompt engineering

method, serving as AQG, for creating English reading problems?
AQG has increasingly shown promise in enhancing educational experiences for

both students and teachers. Previous research has identified two major technical
challenges in integrating AQG into real-world settings: (a) inadequate quality of
question generations and (b) low validity and reliability of evaluating the gen-
erated questions by AQG technology. The reliance on hand-crafted templates in
earlier AQG methods has been a significant issue, leading to inconsistent results.
Although there have been several attempts to integrate machine learning-based
natural language generation models, such as the sequence-to-sequence and logis-
tic regression approaches, inconsistent evaluation methods have hindered their
effectiveness (Kurdi et al., 2020).
In alignment with the RQ, we explored the optimal design and integration
strategy for a prompt engineering method for AGQ in creating English reading
problems. To address these challenges, the present study proposed using Chat-
GPT’s new prompt-based AQG approach, which generates questions based on
specific prompts or topics instead of relying on existing templates. This study
focused on aligning prompting questions and reading comprehension taxonomy
using a 2D matrix-structured framework. We carefully placed prompts to cover
a broad range of question types and difficulty levels in a balanced manner. A
proposed job aid assisted teachers in evaluating and selecting specific prompts
that generate aligned English reading questions. Additionally, we conducted two
rounds of validations to improve and refine the prompts and a protocol for apply-
ing them, ensuring the quality of the generated questions. Iterative validations
with NLP and English education experts helped identify key design issues (chal-
lenges and areas for improvement) in the proposed architecture for AQG imple-
mentation by teachers. We found that providing a two-dimensional matrix with
specific prompts for selecting appropriate prompts for reading materials and
learning objectives is beneficial. Furthermore, we confirm that offering instruc-
tional material guiding teachers through the prompts is critical to engaging teach-
ers more effectively in AQG technology integration in their classroom assess-
ment. Specifically, this approach can assist teachers in identifying their students’
needs and tailoring their teaching strategies accordingly.
13
6.1.2 RQ2. To what extent can a prompt engineering protocol with generated

English problems yield high validity and reliability?
The expert validation findings demonstrated the validity and reliability of the new
prompt-based AQG approach using a ChatGPT. During expert validation, the
overall mean score (3.35) and average CVI score (0.89) indicate that the majority
of questions generated are measured correctly as intended. The IRA score of 0.76
indicates that the results are fair and reliable. This finding demonstrates that the
proposed approach was effective in enhancing the validity of AQG implementa-
tion in English reading problems (Kurdi et al., 2020).
Furthermore, the results revealed AQG’s weaknesses in reading comprehension
for certain types of questions, such as ’yes–no questions with multiple choices’ and
’cloze’ questions. The lower CVI scores for these question types suggest that they
may be inadequate at measuring what they are intended to assess. According to the
teachers’ feedback, the complex sentence structures used in the ChatGPT’s ’yes–no’
questions could be distracting, resulting in decreased validity. Similarly, inconsist-
ency in applying rules for ’cloze’ questions resulted in low ratings for this question
type.
Teachers identified challenges with the generation of yes–no questions and clozes.
The main issue with yes–no questions was their complexity, making them hard for
students to understand. Because of this, students often misinterpreted the questions,
leading to errors not related to content comprehension. ChatGPT, as identified in
several studies (Bang et al., 2023; Alkaissi & McFarlane, 2023; Rudolph et al.,
2023; Ali et al., 2022), can produce a ’hallucination’ effect or plausible but incorrect
answers, particularly for complex tasks. OpenAI’s (2023) own findings corroborated
this observation. It documented the performance of ChatGPT across various tasks,
observing its struggles with tasks that are also challenging for humans.
For cloze items, the problem was not linguistic complexity but ChatGPT’s limita-
tions. Despite giving detailed instructions, the blanks in cloze questions were often
misplaced, requiring teachers’ extra adjustments. To carefully address this issue,
more advanced language models can finely adjust the difficulty level of question
sentence structures and more delicate prompting techniques that accurately target
the intended goals.
Open-ended questions, particularly those seeking ’yes–no’ responses or infer-
ential content, received lower scores. Feedback showed these questions often
focused on less relevant text parts and had unclear structures, mirroring the issues
with yes–no and cloze questions. The root of these challenges is ChatGPT’s lack
of understanding of a passage’s context or the objectives behind a question. For
open-ended questions, an issue arises when the question should emphasize the text’s
purpose. Instead of accurately pinpointing the main intent, ChatGPT might high-
light irrelevant details, asking if they represent the passage’s goal. While students
might answer “no” correctly, it does not provide the reflective opportunity teachers
seek. Ideally, questions should challenge students with plausible options, not clear-
cut wrong ones. Another challenge is with sentence insertion questions. Teachers
generally craft these questions by segmenting sentences and designing them around
critical linguistic elements like conjunctions or transitional phases, aiding students
13
in understanding the flow of texts. However, ChatGPT might randomly choose less
relevant sections for the segmentation.
To address this issue, we either need educators to refine ChatGPT’s outputs or
develop more sophisticated prompting techniques tailored to educational goals. Edu-
cators well-versed in instructions might need to fine-tune the results. We believe that
more detailed prompts could resolve the targeting issues we identified. As of now,
our prompts are concise, often two sentences designed for teachers’ ease of use.
However, a multi-step approach could work better because it breaks down the com-
plex task into simpler and more defined steps. We believe that this approach could
be used to craft an open-ended and yes–no question in a suitable way.
Despite the aforementioned technical constraints, overall, the matrix-driven,
prompt-based AQG method in the current study showcased commendable validity
and reliability. This outcome highlights the method’s efficacy in producing ques-
tions, underscoring its utility for teachers seeking practical applications of AQG in
real-world educational scenarios.
6.2 Implications
6.2.1 Academic implication
This study offers a fresh perspective to the ongoing discussion surrounding the inte-
gration of AQG as an AI tool in educational contexts. Whereas most prior research
has primarily focused on the technical enhancements of AQG, this study broadens
the story by presenting empirical design insights pertaining to the usefulness of
AQG technology. It encompasses a specific prompt design, a protocol, and a 2D
matrix. Notably, certain question labels did not fare well with ChatGPT-driven lan-
guage generation, leading to lower validity scores. Specifically, the cloze and yes–no
questions with multiple option categories presented challenges due to misplaced or
overly complex designs. In addition, a range of open-ended questions, which require
inferential content or sentence insertion options, garnered suboptimal CVI scores.
Based on the study findings, we advocate for the pairing of specific question types
with an AQG protocol to enhance teachers’ technology adoption.
Fundamentally, the outcomes of this study highlight the potential of synergistic
human-technology (AI) collaborations in further education studies. While Chat-
GPT-powered AQG offers a glimpse into teacher perceptions about AI-human col-
laborations in crafting quality English questions, teachers still need to maintain their
pivotal role in evaluating and refining AI-generated content. Performing prompt
engineering for English question creation can help educators equip them with inno-
vative assessment tools and pedagogical strategies. The primary aim of English
teachers involves appraising the acquisition of the four language skills by learners.
To this end, teachers routinely craft quizzes and comprehension tasks to promote
student participation and gauge their comprehension level as formative assess-
ments. Formulating intricate questions, particularly for advanced students, can be a
demanding endeavor.
13
Furthermore, this paradigm shift can engender novel pedagogical methods. Tra-
ditionally, teachers bore the brunt of crafting quizzes, a task amplified in complexity
for higher grade levels. However, with tools that facilitate question generation, this
mantle can be passed to students. For example, it enables activities wherein students
create their comprehension questions, promoting peer-based evaluations. While cer-
tain progressive teaching methods, such as Havruta, have already integrated student-
led questioning to validate mutual understanding, the proficiency to craft English
questions remains scarce among Korean higher-grade students. Consequently, these
tools can reduce the hurdles of question development, allowing students to amplify
their efforts on core tasks like explanations and comprehension.
We propose seeing AQG with ChatGPT as a collaboration between teachers and
AI. Here, the knowledge of teachers enhances the AI’s question-generation ability,
aiming for better English learning outcomes. This point of view prioritizes iterative
and reciprocal human-AI interactions (Ouyang & Jiao, 2021) in order to enhance
students’ learning outcomes synergistically. It is consistent with the possibility of
discussing the underlying concepts of how distributed cognition takes place when
teachers engage with ChatGPT during human-AI collaborations (Kim et al., 2022;
Ouyang & Jiao, 2021; Price & Flach, 2017; Xue & Wang, 2022).
6.2.2 Design implications
This study underscores the need for thorough validation of AI-crafted question
types to ensure their performance in assessing learning outcomes. Using the DDR
approach (Richey & Klein, 2014), we noted improvements in the validity and reli-
ability of auto-generated questions after two validation rounds. This shows that
integrating validation processes can refine AQG outputs, making them valuable
for classrooms. Thus, it is crucial to prioritize validation in future AQG research to
boost question quality.
Our investigation revealed that many teachers require more guidance or manu-
als to fully utilize the protocol’s benefits. This mirrors previous studies (Aiken &
Epstein, 2000; Hwang et al., 2020; Okonkwo & Ade-Ibijola, 2021) that highlight the
importance of AI tool guidelines for successful technology adoption. Many teachers
are still navigating the comprehensive and concrete ideas of using generative AI,
like ChatGPT, for teaching purposes. Hence, giving clear and practical directions
on using the protocol for high-quality English questions is essential. One avenue to
explore is offering AI literacy training (i.e., professional development activities) to
teachers (García-Peñalvo, 2023; Zhai, 2022). For effective use of conversational AI
tools, like ChatGPT, teachers should grasp AI’s basics and its practical applications.
Such training can equip them with fundamental AI knowledge, enabling confident
use of the AQG protocol.
Though our protocol was designed for English reading questions, it is adaptable
for various text types. We have included an appendix in this paper, showcasing the
matrix table, a sample passage, and the generated questions. This resource offers
readers a deep dive into the AQG process using the matrix and ChatGPT. The pro-
tocol’s detailed method and supplementary materials shown in this paper can aid
13
researchers and teachers in crafting top-notch English reading questions in an effi-

cient manner.
6.2.3 Contribution to AQG model development
Prior AQG research has ventured into different techniques, classifying them mainly
as single-hop QG, where questions sourced from a singular sentence are framed.
And multi-hop QG, crafting questions that amalgamate information from several
sentences (Li et al., 2023). However, these studies exhibit limitations. The majority
of questions formulated are Wh- questions, with answers often limited to the origi-
nal paragraph’s lexicon. In addition, the generative procedure of questions is mainly
automated, offering minimal manual control.
This study distinguishes itself by harnessing ChatGPT prompts in the AQG
model, introducing a more varied and adaptable method. Utilizing ChatGPT
prompts in AQG facilitates the creation of diverse question types and allows the
designation of specific answer forms based on English taxonomy. This augmented
flexibility enables researchers to have greater control over the question-generation
process, resulting in more tailored and versatile questions that align with the specific
needs of the study.
6.3 Limitation and future research
The current study acknowledges several areas that need further exploration. First,
We utilized a select group of experts and teachers for validation, potentially limit-
ing our findings’ broad relevance and restricting the generalizability of the findings.
Furthermore, the proposed approach has not been tested in real classrooms. Under-
standing how AQG integrates into educational routines and its effectiveness with
students in authentic settings is crucial. Future research should diversify its partici-
pants and expand sample sizes to ensure our AQG protocols and prompts’ wider
applicability. Second, our primary focus was utilizing GPT-3 and ChatGPT, exclud-
ing other options. Exploring the potential of other AQG technologies, from rule-
based systems to machine learning algorithms, could be beneficial. Further studies
may also assess different metrics to estimate AI-generated question quality, offer-
ing design insights into the effects of varied design choices on AQG performance.
Third, we aimed to evaluate an AQG protocol for English reading questions, yet a
comprehensive exploration of its adaptability across different learning environments
and subjects remains. Future research could assess the protocol’s suitability across
diverse educational scenarios, such as grade levels or curriculum demands. Given
AQG’s scalability, exploring the protocol’s adaptability in varied learning contexts
is crucial (Goel, 2020; Kim et al., 2022). Finally, a deeper exploration of teachers’
views on integrating AQG technology is warranted. Detailed qualitative analyses in
future studies can provide insights into AQG’s classroom adoption from educators’
perspectives.
13
7 Conclusion
We illustrated the potential of AI-human collaboration in AQG using ChatGPT, by

testing its performance on question types and refining generated questions through
multiple rounds of validations. The findings highlight the effectiveness of the pro-
posed AQG protocol in generating high-quality English reading questions. This
method can be effectively integrated into a variety of educational contexts by pro-
viding educators with practical guidance and AI literacy training. Furthermore, our
findings encourage further investigation into the protocol’s adaptability, alternative
AQG technologies, and teachers’ attitudes and challenges when adopting AQG tech-
nology, leading to more effective and efficient problem-solving systems in education.
Appendix 1
An appendix contains supplementary information that is not an essential part of

the text itself but which may be helpful in providing a more comprehensive under-
standing of the research problem or it is information that is too cumbersome to be
included in the body of the paper. Therefore, we add the link in the footnote, which
can show the full manual.2
Introduction to ChatGPT
ChatGPT is a powerful tool developed by OpenAI that can understand and generate
human-like responses in conversations. It has the ability to help generate questions
for different purposes, including educational assessments.ChatGPT works by learn-
ing from a large amount of text data, which helps it understand grammar, vocabu-
lary, and the meaning of words in different contexts. This learning process of Chat-
GPT is analogous to training ChatGPT’s brain to understand language.
Once ChatGPT has learned from the training data, it can be fine-tuned for spe-
cific tasks like generating questions. During the fine-tuning process, it learns to gen-
erate questions that are relevant and appropriate based on a given passage or topic.
To use ChatGPT for question generation, you provide it with a passage of text,
and it uses its knowledge and understanding of language to create questions that test
the reader’s comprehension of the material. It tries to generate questions that make
sense based on the information in the passage.
However, it’s important to know that ChatGPT has some limitations. Sometimes it
may generate incorrect or nonsensical responses, and it can have difficulty understanding
complex or ambiguous queries. Also, it relies on patterns it has learned from the training
data and may generate responses that sound plausible but are actually incorrect.
2
This link connected to the AQG manual in this research. https://docs.google.com/document/d/1h23D
tAVeKHd1AiUvTlpVN3AG073xRg4vygpu4-o82-s/edit?usp=sharing
13
Despite these limitations, ChatGPT has great potential as a tool for generating
questions. By carefully using and validating its responses, it can assist in creating
contextually appropriate and varied questions, making assessments more effective
and efficient.
In summary, ChatGPT is an exciting tool that can understand and generate
human-like responses. It can be used to generate questions that test understanding,
but it’s important to be aware of its limitations and use it wisely in order to benefit
from its capabilities in question generation.
Example passage
The most plausible explanation for the science communication problem is the
public’s limited capacity to comprehend science. The public is only modestly
science literate. About half, we are regularly reminded, understand that the
earth orbits the sun in a year as opposed to a day; less than a quarter knows that
nitrogen is the most common gas in the earth’s atmosphere; less than 10\% can
make sense of a two-by-two contingency table essential to determining the ratio
of true to false positives when assessing medical test results. So how can mem-
bers of the public possibly be expected to understand what scientists are saying
when scientists try to explain complex issues like climate change or nuclear
power? More importantly, members of the public do not think the way scien-
tists do. They rely on rapid, intuitive, affect-driven sources of information pro-
cessing to the exclusion of the deliberate, conscious, analytic ones essential to
making appropriate judgments of risk. As a result, they tend to overestimate the
magnitude of more emotionally charged disasters (e.g., terrorist attacks) and
discount more consequential but more temporally or emotionally remote ones
(e.g., the impact of human-caused climate change). They also are more likely to
rely on defective heuristics, such as crediting the opinions of their peers, a form
of reasoning that can trigger self-reinforcing states of polarization.
Example output
We also add the link in the footnote, which can show full example outputs.3
3
This link is connected to the AQG Example Output in this research.
https://docs.google.com/document/d/1h23DtAVeKHd1AiUvTlpVN3AG073xRg4vygpu4-o82-s/edit?
usp=sharing
13
Example 1
A1 Identify / y-n, alternative, t-f / multiple-choice

Prompt Passage / make a yes–no question of identify-
ing information that is explicitly shown in
the text. After questions, give ‘yes or no’
choice option
Question Is the public’s reliance on rapid, intuitive,
affect-driven sources of information pro-
cessing a result of their limited capacity to
comprehend science? Yes or no: No
Example 2
A1 Identify / y-n, alternative, t-f / multiple-choice

Prompt Passage / Make a multiple choice, ‘wh’ question asking you to choose what is the
topic of the passage
Question What is the main topic discussed in the passage?
A) The public’s limited capacity to comprehend science
B) The Earth’s orbit around the Sun
C) Nitrogen in the atmosphere
D) Two-by-two contingency tables for medical testing
E) Heuristics and polarization in the public’s thinking about science-related issues
Example 3
E4 Inference / cloze / multiple-choice

Prompt Passage / Make a cloze test with multiple-choice at the
sentence that contains information that can be implicitly
inferred from the text. Rewrite the whole text with the
cloze. Put the options at the end of the text
13
Question Please fill in the blank with the correct answer:

The most plausible explanation for the science com-
munication problem is the public’s limited capacity
to comprehend science. The public is only modestly
science literate. About half, we are regularly reminded,
understands that the earth orbits the sun in a year as
opposed to a day; less than a quarter knows that nitrogen
is the most common gas in the earth’s atmosphere; less
than 10 percent can make sense of a two-by-two contin-
gency table essential to determining the ratio of true to
false positives when assessing medical test results. So
how can members of the public possibly be expected to
understand what scientists are saying when scientists try
to explain complex issues like climate change or nuclear
power? More importantly, members of the public do not
think the way scientists do. () sources of information
processing to the exclusion of the deliberate, conscious,
analytic ones essential to making appropriate judg-
ments of risk. As a result, they tend to overestimate the
magnitude of more emotionally charged disasters (e.g.,
terrorist attacks) and discount more consequential but
more temporally or emotionally remote ones (e.g., the
impact of human-caused climate change). They also are
more likely to rely on defective heuristics, such as credit-
ing the opinions of their peers, a form of reasoning that
can trigger self-reinforcing states of polarization
A) They rely on rapid, intuitive, affect-driven
B) They rely on slow, logical, affect-neutral
C) They rely on a combination of rapid, intuitive, affect-
driven and deliberate, conscious, analytic
D) They rely solely on deliberate, conscious, analytic
Example 4
D5 Sentiment / wh-q / open-ended

Prompt Passage / Make an open-ended question that asks the
main characters’ sentiments in the P1, using ’wh’ ques-
tion form
Question What were Claire and Tony’s feelings towards the situa-
tion with the oil-covered bird on the beach and the boys
trying to feed it?
Example 5
C8 Irrelevant Sentence / wh-q / multiple-choice

Prompt Passage /Put an irrelevant sentence somewhere in the text. Rewrite the text
including the irrelevant sentence. Make a test question that requires me to
choose an irrelevant sentence from the text
13
Question The most plausible explanation for the science communication problem
is the public’s limited capacity to comprehend science. The public is
only modestly science literate. About half, we are regularly reminded,
understands that the earth orbits the sun in a year as opposed to a day;
less than a quarter knows that nitrogen is the most common gas in the
earth’s atmosphere; less than 10 percent can make sense of a two-by-two
contingency table essential to determining the ratio of true to false posi-
tives when assessing medical test results. The moon is made of cheese.
So how can members of the public possibly be expected to understand
what scientists are saying when scientists try to explain complex issues
like climate change or nuclear power? More importantly, members of the
public do not think the way scientists do. They rely on rapid, intuitive,
affect-driven sources of information processing to the exclusion of the
deliberate, conscious, analytic ones essential to making appropriate judg-
ments of risk. As a result, they tend to overestimate the magnitude of more
emotionally charged disasters (e.g., terrorist attacks) and discount more
consequential but more temporally or emotionally remote ones (e.g., the
impact of human-caused climate change). They also are more likely to
rely on defective heuristics, such as crediting the opinions of their peers, a
form of reasoning that can trigger self-reinforcing states of polarization
Test question:
Which sentence in the passage is irrelevant to the topic of science com-
munication?
A) "About half, we are regularly reminded, understands that the earth orbits
the sun in a year as opposed to a day"
B) "The moon is made of cheese"
C) "More important still, members of the public do not think the way
scientists do."
D) "They also are more likely to rely on defective heuristics, such as credit-
ing the opinions of their peers"
The teacher validation results of each prompt
Mean Variation CVI IRA IRA

(each item)
13
A1 3.50 0.29 1.00 1 0.76

B1 3.38 0.55 0.88 0.88
C1 3.88 0.13 1.00 1
D1 3.38 0.27 1.00 1
E1 3.63 0.27 1.00 1
F1 3.25 0.50 0.88 0.88
A2 2.88 0.41 0.75 0.75
B2 3.25 0.21 1.00 1
C2 3.50 0.29 1.00 1
D2 3.25 0.21 1.00 1
E2 3.38 0.27 1.00 1
F2 3.00 0.57 0.75 0.75
A3 3.13 0.70 0.75 0.75
B3 3.25 0.79 0.75 0.75
C3 3.50 0.29 1.00 1
D3 3.13 0.41 0.88 0.88
A4 2.75 0.79 0.50 0.5
B4 3.25 0.50 0.88 0.88
C4 3.75 0.21 1.00 1
D4 3.38 0.55 0.88 0.88
E4 3.38 0.55 0.88 0.88
F4 3.00 1.43 0.63 0.63
A5 3.25 1.07 0.88 0.88
B5 3.63 0.27 1.00 1
C5 3.63 0.27 1.00 1
D5 3.63 0.27 1.00 1
E5 3.88 0.13 1.00 1
F5 3.25 0.50 0.88 0.88
A6 3.25 0.50 0.88 0.88
B6 3.38 0.55 0.88 0.88
C6 3.50 0.29 1.00 1
D6 3.50 0.29 1.00 1
E6 3.50 0.29 1.00 1
F6 3.13 1.27 0.75 0.75
C8 3.50 0.57 0.88 0.88
C9 3.00 1.14 0.75 0.75
C10 3.38 0.55 0.88 0.88
Average 3.35 0.89
Data Availability The availability of the data supporting the findings of this study is confirmed, while cer-
tain limitations have been imposed due to the inclusion of personal interviews and sensitive material. As
a result, the data is not able to be made publicly available. However, the authors will evaluate data access
requests individually, taking into account the reasonableness of the request and if proper rights have been
obtained.
13
Declarations
Conflict of Interest None.
Acknowledgements This work was supported by Institute of Information & Communications Technol-
ogy Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00368, A
Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques).
References
Aiken, R. M., & Epstein, R. G. (2000). Ethical guidelines for AI in education: Starting a conversation.
International Journal of Artificial Intelligence in Education, 11(2), 163–176.
Alsubait, T., Parsia, B., & Sattler, U. (2016). Ontology-based multiple choice question generation. KI-
Künstliche Intelligenz, 30, 183–188. https://doi.org/10.1155/2014/274949
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., ... & Fung, P. (2023). A multitask, multi-
lingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv pre-
print arXiv:2302.04023. https://doi.org/10.48550/arXiv.2302.04023
Ben Abacha, A., Dos Reis, J. C., Mrabet, Y., Pruski, C., & Da Silveira, M. (2016). Towards natural lan-
guage question generation for the validation of ontologies and mappings. Journal of Biomedical
Semantics, 7, 1–15. https://doi.org/10.1186/s13326-016-0089-6
Brown, H. D., & Abeywickrama, P. (2004). Language assessment. Principles and Classroom Practices.
Pearson Education.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P.,
Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh,
A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners.
Advances in neural information processing systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.
2005.14165
Cao, T., Zeng, S., Xu, X., Mansur, M., & Chang, B. (2022). DISK: Domain-constrained Instance Sketch
for Math Word Problem Generation. arXiv preprint arXiv:2204.04686. https://doi.org/10.48550/
arXiv.2204.04686
Das, B., Majumder, M., Phadikar, S., & Sekh, A. A. (2021). Automatic question generation and answer
assessment: A survey. Research and Practice in Technology Enhanced Learning, 16(1), 1–15.
https://doi.org/10.1186/s41039-021-00151-1
Day, R. R., & Park, J. S. (2005). Developing Reading Comprehension Questions. Reading in a Foreign
Language, 17(1), 60–73.
Duke, N. K., & Pearson, P. D. (2009). Effective practices for developing reading comprehension. Journal
of Education, 189(1–2), 107–122. https://doi.org/10.1598/0872071774.10
Floridi, L., & Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and
Machines, 30, 681–694. https://doi.org/10.1007/s11023-020-09548-1
Gao, Y., Wang, J., Bing, L., King, I., & Lyu, M. R. (2018). Difficulty controllable question generation for
reading comprehension. arXiv preprint arXiv:1807.03586. https://doi.org/10.48550/arXiv.1807.03586
García-Peñalvo, F. J. (2023). The perception of Artificial Intelligence in educational contexts after the
launch of ChatGPT: Disruption or Panic?. Education in the Knowledge Society, 24. https://doi.org/
10.14201/eks.31279
Goel, A. (2020). Ai-powered learning: making education accessible, affordable, and achievable. arXiv
preprint arXiv:2006.01908. https://doi.org/10.48550/arXiv.2006.01908
Hwang, G. J., Xie, H., Wah, B. W., & Gašević, D. (2020). Vision, challenges, roles and research issues
of Artificial Intelligence in Education. Computers and Education: Artificial Intelligence, 1, 100001.
Kim, Y. M., & Kang, M. K. (2012). The external analysis of the validation on item-types of foreign lan-
guage (English) domain of CSAT. Modern English Education, 13(4), 239–270.
Kim, J., Lee, H., & Cho, Y. H. (2022). Learning design to support student-AI collaboration: Perspectives
of leading teachers for AI in education. Education and Information Technologies, 27(5), 6069–6104.
https://doi.org/10.1007/s10639-021-10831-6
13
Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic ques-
tion generation for educational purposes. International Journal of Artificial Intelligence in Educa-
tion, 30, 121–204. https://doi.org/10.1007/s40593-019-00186-y
Larrañaga, M., Aldabe, I., Arruarte, A., Elorriaga, J. A., & Maritxalar, M. (2022). A Qualitative Case
Study on the Validation of Automatically Generated Multiple-Choice Questions From Science Text-
books. IEEE Transactions on Learning Technologies, 15(3), 338–349. https://doi.org/10.1109/TLT.
2022.3171589
Li, Z., Cao, Z., Li, P., Zhong, Y., & Li, S. (2023). Multi-Hop Question Generation with Knowledge
Graph-Enhanced Language Model. Applied Sciences, 13(9), 5765. https://doi.org/10.3390/app13
095765
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A
systematic survey of prompting methods in natural language processing. ACM Computing Surveys,
55(9), 1–35. https://doi.org/10.48550/arXiv.2107.13586
Lynn, M. R. (1986). Determination and quantification of content validity. Nursing Research, 35(6),
382–386.
Morón, M., Scocozza, J., Chiruzzo, L., & Rosá, A. (2021, November). A tool for automatic question
generation for teaching English to beginner students. In 2021 40th International Conference of the
Chilean Computer Science Society, 1–5. IEEE. https://doi.org/10.1109/SCCC54552.2021.9650423
Okonkwo, C. W., & Ade-Ibijola, A. (2021). Chatbots applications in education: A systematic review.
Computers and Education: Artificial Intelligence, 2, 100033. https://doi.org/10.1016/j.caeai.2021.
100033
OpenAI. (2023). GPT-4 Technical Report. arXiv:2303.08774. https://doi.org/10.48550/arXiv.2303.08774
Oppenlaender, J. (2022). A Taxonomy of Prompt Modifiers for Text-To-Image Generation. arXiv preprint
arXiv:2204.13988. https://doi.org/10.48550/arXiv.2204.13988
Ouyang, F., & Jiao, P. (2021). Artificial intelligence in education: The three paradigms. Computers and
Education: Artificial Intelligence, 2, 100020. https://doi.org/10.1016/j.caeai.2021.100020
Pan, L., Lei, W., Chua, T. S., & Kan, M. Y. (2019). Recent advances in neural question generation. arXiv
preprint arXiv:1905.08949. https://doi.org/10.48550/arXiv.1905.08949
Price, S., & Flach, P. A. (2017). Computational support for academic peer review: A perspective from
artificial intelligence. Communications of the ACM, 60(3), 70–79. https://doi.org/10.1145/2979672
Richey, R. C., & Klein, J. D. (2005). Developmental research methods: Creating knowledge from instruc-
tional design and development practice. Journal of Computing in Higher Education, 16, 23–38.
https://doi.org/10.1007/BF02961473
Richey, R. C., & Klein, J. D. (2014). Design and development research: Methods, strategies, and issues.
Routledge. https://doi.org/10.4324/9780203826034
Rubio, D. M., Berg-Weger, M., Tebb, S. S., Lee, E. S., & Rauch, S. (2003). Objectifying content validity:
Conducting a content validity study in social work research. Social Work Research, 27(2), 94–104.
https://doi.org/10.1093/swr/27.2.94
Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments
in higher education?. Journal of Applied Learning and Teaching, 6(1). https://doi.org/10.37074/jalt.
2023.6.1.9
Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., & Singh, S. (2020). Autoprompt: Eliciting knowledge
from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
https://doi.org/10.48550/arXiv.2010.15980
Shohamy, E. (1984). Does the testing method make a difference? The case of reading comprehension.
Language Testing, 1(2), 147–170. https://doi.org/10.1177/026553228400100203
Soonklang, T., & Muangon, W. (2017). Automatic question generation system for English exercise for
secondary students. In the 25th international conference on Computers in education.
Van Campenhout, R., Dittel, J. S., Jerome, B., & Johnson, B. G. (2021). Transforming Textbooks into
Learning by Doing Environments: An Evaluation of Textbook-Based Automatic Question Genera-
tion 60-73. In iTextbooks@ AIED.
Xue, Y., & Wang, Y. (2022). Artificial intelligence for education and teaching. Wireless Communications
and Mobile Computing, 1–10. https://doi.org/10.1155/2022/4750018
Zhai, X. (2022). ChatGPT user experience: Implications for education. (December 27, 2022). Available
at SSRN: https://ssrn.com/abstract=4312418 or https://doi.org/10.2139/ssrn.4312418
Zhang, N., Li, L., Chen, X., Deng, S., Bi, Z., Tan, C., ... & Chen, H. (2021). Differentiable prompt makes
pre-trained language models better few-shot learners. arXiv preprint arXiv:2108.13161. https://doi.
org/10.48550/arXiv.2108.13161
13
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and
applicable law.
Authors and Affiliations
Unggi Lee1 · Haewon Jung2 · Younghoon Jeon3 · Younghoon Sohn1 ·

Wonhee Hwang4 · Jewoong Moon5 · Hyeoncheol Kim1
* Unggi Lee
codingchild@korea.ac.kr
* Hyeoncheol Kim
hkim64@gmail.com
Haewon Jung
hwjunghw@snu.ac.kr
Younghoon Jeon
yesica@upstage.ai
Younghoon Sohn
wi619nd@naver.com
Wonhee Hwang
gml2416@sen.go.kr
Jewoong Moon
jmoon19@ua.edu
1
Department of Computer Science and Engineering, Korea University, Seoul, South Korea
2
Bongdam Highschool, Gyeonggi, South Korea
3
Upstage AI Education & Content, Upstage, Yongin‑Si, South Korea
4
Seoul Jayang Elementary School, Seoul, Republic of Korea
5
Department of Educational Leadership, Policy, and Technology Studies (ELPTS), The
University of Alabama, Tuscaloosa, AL, USA
13

Few Shot Is Enough: Exploring Chatgpt Prompt Engineering Method For Automatic Question Generation in English Education

Uploaded by

Copyright:

Available Formats

You might also like

Few Shot Is Enough: Exploring Chatgpt Prompt Engineering Method For Automatic Question Generation in English Education

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Few Shot Is Enough: Exploring Chatgpt Prompt Engineering Method For Automatic Question Generation in English Education

Uploaded by

Copyright:

Available Formats

Education and Information Technologies

Few‑shot is enough: exploring ChatGPT prompt

Unggi Lee1 · Haewon Jung2 · Younghoon Jeon3 · Younghoon Sohn1 ·

Received: 19 April 2023 / Accepted: 28 September 2023

Keywords Automatic Question Generation · Prompt Engineering · ChatGPT · Large

Extended author information available on the last page of the article

3.1 English reading comprehension question

3.2 Automatic question generation in English education

Forms Type of Comprehension

3.3 English reading comprehension question

In order to address the limitations of AQG, the use of neural-based question-gener-

4.1 Participants and procedure

4.2 Tools and intervention

4.3 English reading comprehension question

In this study, we employed a combination of interviews and questionnaires to vali-

4.4.1 AQG and protocol development

Table 2 The demographic of the expert group

A Engineer 5 B.S. degree in Computer science,

The protocol’s guidelines primarily instruct users on leveraging ChatGPT and

Fig. 2 Research process diagram

Table 3 The demographic of the Gender Experience Type of school

A Female 10 High school

education. We adopted a 4-point Likert scale for gaining quantifiable feedback.

5.1 Prototyping the protocol and framework

5.2 Expert validation: first‑round

In the initial protocol, we conducted a first-round validation with experts to verify

Corresponding response Y-N Q Multiple-choice 3.33 0.58 1.00 1.00 0.60

assessment procedure, and we made improvements to enhance the prompts’ validity

1. The change in the cloze questions’ subcategorization and prompt

5.4 Teacher validation: second‑round

Table 6 The teacher validation Mean Variation CVI IRA IRA

Comprehensiveness 2.88 0.41 0.75 0.75 0.67

5.5 Finalized protocol and generated questions

1. Select a passage to generate English reading questions.

6 Discussion, Implication, and Conclusion

6.1.1 RQ1. What is the optimal design and integration of a prompt engineering

AQG has increasingly shown promise in enhancing educational experiences for

6.1.2 RQ2. To what extent can a prompt engineering protocol with generated

researchers and teachers in crafting top-notch English reading questions in an effi-

6.2.3 Contribution to AQG model development

6.3 Limitation and future research

We illustrated the potential of AI-human collaboration in AQG using ChatGPT, by

An appendix contains supplementary information that is not an essential part of

A1 Identify / y-n, alternative, t-f / multiple-choice

A1 Identify / y-n, alternative, t-f / multiple-choice

E4 Inference / cloze / multiple-choice

Question Please fill in the blank with the correct answer:

D5 Sentiment / wh-q / open-ended

C8 Irrelevant Sentence / wh-q / multiple-choice

The teacher validation results of each prompt

Mean Variation CVI IRA IRA

A1 3.50 0.29 1.00 1 0.76

Authors and Affiliations

Unggi Lee1 · Haewon Jung2 · Younghoon Jeon3 · Younghoon Sohn1 ·

You might also like

3.1 English reading comprehension question

3.2 Automatic question generation in English education

3.3 English reading comprehension question

4.1 Participants and procedure

4.2 Tools and intervention

4.3 English reading comprehension question

4.4.1 AQG and protocol development

5.1 Prototyping the protocol and framework

5.2 Expert validation: first‑round

5.4 Teacher validation: second‑round

5.5 Finalized protocol and generated questions

6 Discussion, Implication, and Conclusion

6.1.1 RQ1. What is the optimal design and integration of a prompt engineering

6.1.2 RQ2. To what extent can a prompt engineering protocol with generated

6.2.3 Contribution to AQG model development

6.3 Limitation and future research