Automated Essay Feedback Generation and Its Impact in The Revision

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TLT.2016.2612659, IEEE Transactions on Learning Technologies
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1
Automated Essay Feedback Generation and

Its Impact in the Revision
Ming Liu, Yi Li, Weiwei Xu and Li Liu
Abstract—Writing an essay is a very important skill for students to master, but a difficult task for them to overcome. It is
particularly true for English as Second Language (ESL) students in China. It would be very useful if students could receive
timely and effective feedback about their writing. Automatic essay feedback generation is a challenging task, which requires
understanding the relationship between the text features of the essay and feedback. In this study, we first analyzed 1290
teacher comments on their 327 English-major students and annotated the feedback on seven aspects of writing, including the
grammar, spelling, sentence diversity, structure, organization, supporting ideas, coherence and conclusion, for each paper.
Then, an automatic feedback classification experiment was conducted with the machine learning approach. Finally, we
investigated the impact of the system generated-indirect corrective feedback (ICF) and human teachers’ direct corrective
feedback (DCF) in two English writing classes (N=56 in ICF class; N=54 in DCF class) at a key Chinese university through a
web-based assignment management system. The study results indicated the feasibility of this approach that system generated
ICF can be as useful as direct comments made by the teachers in terms of improving the quality of the content regarding to the
structure, organization, supporting ideas, coherence and conclusion, and encouraging students to spend more time on self-
correction.
Index Terms—Writing Feedback, Text Analysis, Natural Language Processing
——————————u——————————
1 INTRODUCTION
W ith the coming of the 21st century and the globaliza-

tion of English, English essay writing, as one of the
Test (CET) 4 or Test for English-Major (TEM) 4. Essay
writing is the last part of these tests. As to the EFL (Eng-
four basic skills of language learning, has become a more lish as Foreign Language) teaching practice in China,
and more important skill. It not only requires some basic where big class is the norm with enormous amount of
writing skill, such as spelling and grammar, but also asks information to be dealt with and learning is largely exam
for some high competency of writing, such as coherence, oriented, getting timely feedback for each EFL learner’s
structure and reasoning. Thus, it is a difficult task to writing task is thus often difficult.
overcome. It is particularly true for students in China. Since the early 1980s, researchers have investigated the
Statistics show that the number of college students in effectiveness of teacher feedback as a way of improving
China has soared to twenty-six million in 2013 [1], ac- students’ writing [2]. A substantial amount of research on
counting for the largest proportion of English as Second teacher written feedback in ESL writing contents has been
Language (ESL) learners worldwide. Since 1987, the writ- concerned with the benefits of the corrective feedback in
ing test has become one important aspect of the College students’ writing development [3]. Corrective feedback is
English testing in China. As for college students in China, a commonly used feedback type in classrooms: the mark-
college English is an obligatory course to take and a fair ing of a student’s error by the teacher. Fathman and
score of the College English Test is required of all Chinese Whalley [4] found positive effects for rewriting from cor-
students graduating from any university. In a typical rective feedback on both grammar and content. However,
English course, students have to do two or three essay trying to establish a direct link between corrective feed-
writing assignments and take an essay writing test in or- back and successful second language acquisition is over-
der to pass national English tests, such as College English simplistic and highly problematic [5].
An increasing number of studies have been conducted
————————————————
to see whether certain types of written feedback are more
• M.Liu is with the School of Computer and Information Science, Southwest
University, Chongqing 400715, China. E-mail: mingliu@swu.edu.cn
likely than others to help ESL students improve the accu-
• Y.Li is with the Faculty of Education, Southwest University, Chongqing racy of their writing, such as ICF and DCF [6]–[9]. DCF or
400715, China, E-mail: liyi1807@swu.edu.cn. explicit feedback occurs when the teacher identifies an
• W.W. Xu is with the College of International Studies, Southwest Universi- error and provides the correct form or explicit sugges-
ty, Chongqing 400715, China, E-mail: 534514401@qq.com
• L. Liu is with School of Software Engineering, Chongqing University, tions to fix the problem, while ICF or implicit feedback
Chongqing 400044, P.R.China, E-mail: dcsliuli@cqu.edu.cn. refers to situations when the teacher indicates that an er-
ror has been made but does not provide a correction,
Please note that all acknowledgments should be placed at the end of the paper, thereby leaving the student to diagnose and correct it.
before the bibliography(note that corresponding authorship is not noted in
affiliation box, but in acknowledgment section).
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
xxxx-xxxx/0x/$xx.00 © 200x IEEE
See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Published by the IEEE Computer Society
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
**************************************************************************************************
Studies examining the effects of these different types of 2. Can these comments or feedback be automatical-
error feedback on students’ second language (L2) writing, ly detected using the machine learning approach?
have reported positive impacts of ICF on the ability of 3. What is the impact of the system generated ICF
students to edit their own composition and to improve and human teachers’ direct comments on the
levels of accuracy in writing because the ICF leads to a quality of writing?
reflection on writing and a greater cognitive engagement The rest of this paper is constructed as follows: Section
[10]–[12]. Indeed, Reflection is an important language 2 presents related work on automated essay feedback
learning step [13]. ICF encourages students to critically systems. Section 3 and 4 describe our approach to auto-
evaluate their own written performance in the target lan- matic essay feedback generation and the system evalua-
guage with the goal of improving not only their linguistic tion result. Section 5 presents the user study that investi-
competence and skill, but also their ability to learn [10], gates the impact of two types of feedbacks, ICF and DCF,
[14], [15]. in the quality of writing and discusses the results. Finally,
With the advanced development of natural language Section 6 concludes this paper.
processing techniques and statistical models, several
commercial automated essay scoring systems were de-
2 SURVEY ON AUTOMATED ESSAY FEEDBACK
veloped, such as e-rater developed by Educational Test-
ing Service (ETS) [16], Knowledge Analysis Technologies Automated feedback systems for writing support can be
and IntelliMetric [17] to analyze a wide range of text fea- traced back to Automated Essay Scoring (AES) developed
tures at lexical, syntactic, semantic, and discourse levels. in the 1950s. Essay assessment is a time-consuming and
Based on these scoring systems, some automated writing costly process. Sometimes, it leads to many inconsist-
evaluation (AWE) tools, such as Criterion[18] and MYAc- encies in the grades given by different human raters [27].
cess, have been developed to provide corrective feedback The early AES systems were used to overcome time, cost
or scores on various rhetorical (e.g., organization) and and reliability issues in writing assessment [28]. Project
language-related dimensions (e.g., grammar and mechan- Essay Grader (PEG) is one of the first AES systems that
ics) [19]. Advantages of automated feedback are its ano- used an essay’s objective features, such as word count or
nymity, instantaneousness, and encouragement for repet- spelling errors, and gave a score about the quality of each
itive improvements by giving students more practice for feature. The experimental results indicated that the sys-
writing essays [20]. Some researchers have argued that tem-predicted scores were comparable to those of human
AWE might lead to negative effects on students’ writing raters. However, PEG only focused on the surface fea-
behavior [21] since students focused on improving scores, tures and ignored the semantic aspects of essays, such as
rather than content development. coherence [29]. With the advance of text mining tech-
Few researchers [22]–[25] attempted to generate ICF on niques, these AES systems can provide scores as feedback
content development. The generated feedbacks or ques- on semantic aspect of writing, such as topic coverage,
tions were used to scaffold student reflection on the dif- discourse structure and coherence [30]. Haswell et al. [19]
ferent aspects of the writing content, such as cohesion and argued that these AES systems focused primarily on
organization. For example, the Glosser system [22] used providing holistic grades and less on meaningful feed-
text mining algorithms, such as Latent Semantic Analysis back on writing[31]. Ericsson and Haswell [32] also criti-
[26], to provide content clues about issues related to co- cized these AES systems, as they deemed them focused
herence and topics to scaffold students reflection with a mainly on providing feedback to improve the grades. The
set of generic trigger questions. authors claimed that such approach devaluated the role
We consider the work of Glosser that points in the di- of teachers as well as warped students’ notions of good
rection that we have followed in this project. In our ap- writing. However, according to other researchers [18]
proach, the system first points out the weaknesses of these tools could motivate students to write and revise.
some aspects of the writing, such as organization and Gibbs and Simpson [33] defined several characteristics for
structure. Then, it provides trigger questions to support an effective feedback, stating that it should be timely,
student self-reflection on the weaknesses in the writing. specific enough, and focus on learning rather than marks.
Lastly, students performed self-correction on the writing. Despite a variety of initiatives to improve the quality of
The system-generated ICF suggested students to “double- automatic feedback, the effectiveness of proposed sys-
check” their essay with several trigger questions provid- tems remains to be proven and further research is needed.
ed. Instead of revising only to correct errors, students try Because AES systems have been developed for assess-
to reconsider and refine the whole text. The aim of this ment, rather than to assist learning, many researchers [34]
study is to explore the challenges of automated essay have tried to bring the focus back to learning(through
feedback generation and the effects of system-generated automated feedback) instead of scores. They used tech-
ICF during the revision in the context of Chinese ESL col- nologies similar to AES systems to extract document fea-
lege students’ writing. To fulfill the above-mentioned tures, trying to translate these features into useful infor-
aims, the following research questions were posed: mation, typically related to the common writing problem.
1. What are the frequent aspects of the essay com- One of the challenges of these approaches is to make the
mented by college English teachers in the context feedback specific, so that students can understand how to
of Chinese ESL learners? improve their writing.
1939-1382 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
****************************************
Figure 1: The user interface of the topics tool in Glosser. The figure is reproduced from Calvo et al. [22]
Many automatic writing feedback systems have been de- Matching technique to detect the explicit citations by rec-
signed to address specific writing problems. Some of the ognizing phrases containing author name (e.g. According
early systems including Editor [35], developed at Roches- to, As stated in, State). Evaluations have showed that
ter Institute of Technology and Writers Workshop [36], SAIF provides feedback that encourages more explicit
developed at Bell Laboratories, focused on grammar and citations in students’ essays. However, SAIF only ad-
style check. Research studies on the impact of Editor [37] dresses some basic problems related to sourcing and inte-
concluded that the pedagogical benefits of grammar and gration. In addition, it requires a large number of source
style checking were limited. It could also be argued that documents to build the LSA semantic space and a large
these systems only focus on the final product. number of predefined pattern matching rules. Based on
Recently, many automatic writing feedback systems this technology, Kakkonen and Sutinen [39] proposed a
started using text mining techniques to provide more so- model for the assessment of free text that combines both
phisticated feedback. Sourcer’s Apprentice Intelligent computerized and human models of assessment.
Feedback system (SAIF) [38] also used text mining tech- The most relevant work to the present study is Glosser,
niques to provide feedback for students to write essays. which is an automatic writing feedback system that pro-
The system can be used to detect plagiarism, uncited quo- vides academic essay writing support for college students
tations, lack of citations, and limited content integration [22], [34]. It uses text mining algorithms to analyze vari-
problems. Once a problem is detected, SAIF can give ous features of texts, based on which feedback is provid-
helpful feedback to the student, such as “Reword plagia- ed to student writers. Glosser (1.0) provides feedback on
rism and model proper format,” if the problem is un- some aspects of the writing, such as flow, topics, and top-
sourced copied material (plagiarism). SAIF uses Latent ic map visualization. The feedback is given in the form of
Semantic Analysis (LSA) to calculate the average distance generic trigger questions (adapted to each course) and
between consecutive sentences and provide feedback on document features that relate to each set of questions.
the overall coherence of the text. LSA is a technique used Figure 1 displays the user interface of Topics feedback.
to measure the semantic similarity of texts [26]. For find- The generic trigger questions (e.g. are the ideas used in
ing citations, SAIF uses a Regular Expression Pattern the essay relevant to the question? Are the ideas devel-
information.
**************************************************************************************************
oped correctly?) are provided at the top of the page to 3 DATA COLLECTION AND FEEDBACK ANNOTATION
help the writer focus their evaluation of the essay. The
The diagnostic assessment of writing is an important as-
extracted document feature called ‘gloss’ is shown below
pect of second language abilities test, which focuses more
the questions. In this case, the gloss refers to Topics on the
on specific features rather than global abilities [45]. Rat-
left-hand side of the table shown in the figure, such as
ing scales represent the construct on which the perfor-
Global Language, Countries and Learn English, and im-
mance evaluation is based. North [46] reviewed several
portant sentences are listed on the right-hand side of the
rating scales including four skill models and model of
table. As Glosser highlights the ‘gloss’ in the essay, stu-
communicative language ability. Based on these findings
dents are learning during the process of reflection. How-
and writing theories, Knoch [47] proposed a more com-
ever, Glosser used a set of generic questions to trigger
prehensive and practical model for assessing second lan-
reflection. Our previous approach used natural language
guage writing tests. He defined eight feature categories,
processing techniques to generate content specific trigger
including accuracy, fluency, complexity, mechanics, cohe-
questions based on citations provided in the student
sion, coherence, reader/writer interaction and content.
academic essays for helping reflection [23], [25].
The accuracy category contains grammar feature, while
Our current research can be considered as an extension
the mechanics category contains spelling feature. The
of the existing Glosser system. Like Glosser, we analyzed
complexity contains sentence diversity feature. The con-
the student essays and automatically generated ICF ad-
tent contains supporting ideas and organization features.
dressing some aspects of writing. However, our approach
In this study, we adapted Knoch model to annotate teach-
focused on more aspects of the writing, such as Grammar,
ers’ comments because it is more relevant to our case and
Spelling, Sentence Diversity, Supporting Ideas and Organiza-
covers a wider range of features than other models.
tion, since these aspects were frequently addressed in the
We investigated 1290 feedback comments written by
teachers’ feedback in the context of Chinese ESL learners’
10 college English teachers based on 327 essays written by
writing based on our empirical study findings. From the
those second-year college students enrolled at the com-
technology point of view, we adapted the supervised ma-
prehensive English class from 2013 to 2015. All those stu-
chine learning approach to classify the quality of essays
dents were English majors from College of International
regarding to each writing aspect. Specifically, the textual
Studies at Southwest University. The writing task was
feature model was built by using the latest natural lan-
timed and considered as an assignment in the English
guage processing techniques.
class. Students were required to finish it within 40
Recent development in natural language processing
minutes. The writing task was to write a persuasive essay
techniques has made it possible for researchers to develop
following the standard of college English essay writing
a wide range of sophisticated techniques that facilitate
set by Ministry of Education in China. The essay question
text analysis. Some tools, such as Coh-Metrix [40], LIWC
is about “Children Should Get Paid By Doing Housework”.
[41] and Gramulator [42], are useful in this respect, and
The first task was to group the comments into frequent
have certainly contributed to ESL knowledge [43]. Coh-
essay feature categories based on the Knoch model. Two
Metrix is a powerful computational tool that provides
experienced English teachers volunteered to annotate the
over 100 indices of cohesion, syntactical complexity, con-
teachers’ comments on these essays. They had at least five
nectives and other descriptive information about content
years of teaching composition course for English majors.
[40]. Coh-Metrix has been extensively used to analyze the
Seven frequent essay feature categories were found, in-
overall quality of writing [43] and important aspects of
cluding Grammar (N=144), Spelling (N=36), Sentence Diver-
writing quality, such as coherence [44]. In this study, we
sity (N=120), Conclusion (N=132), Supporting Ideas (N=294),
used Coh-Metrix to extract features to build the feedback
Organization (N=267), and Coherence (N=120). These find-
classification model. The major contributions of this paper
ings are supported with existing research evidence that
are the following:
ESL teachers pay a great deal of attention to student writ
1. Proposed a novel approach to automatically gener-
ten errors [6] (including grammatical and spelling errors),
ate essay feedback. Compared with the previous auto-
Content (including supporting ideas, coherence and sen-
mated essay feedback system [22], our system applies the
tence diversity) [48], Organization [48]. In fact, current
supervised machine learning approach to classify the
commercial AES, such as Criterion [18], included some of
quality of essays regarding to each writing aspect and
these features, such as Grammar, Spelling and Supporting
focuses on more aspects of the writing, which are fre-
Ideas, Conclusion.
quently commented by Chinese English teachers.
The second task was to ask the two teacher annotators
2. Conducted the quasi-experimental evaluation of au-
to score each essay feature based on the rubric defined in
tomatic feedback technologies for writing, in the context
the Appendix on a scale of 3. 1 means negative feedback
of an English as a second language course in China. There
on an essay feature, 2 means neutral while 3 means posi-
are very few studies in which control and experimental
tive feedback on an essay feature. This analytic rubric is
groups are on their use of novel technologies [18]. Par-
an adapted version of Knoch [47], focusing on seven es-
ticularly, our study examined the impact of the ICF on the
say features mentioned above. We constructed it after
content related aspects of the essay.
informal interviews with those 10 English teachers about
the criteria they used for evaluating their students written
work. As we mentioned before, those English teachers
helped us to collect the dataset, including the student es-
information.
****************************************
says and teacher feedbacks. The annotators were first language discourse structure and how connections within
trained to use the rubric with 10 essays. A Pearson corre- a text influence cohesion and text comprehension[52].
lation for each rubric evaluation was conducted between Sentence Complexity: The grammatical structure of a
the raters’ responses. If the correlations between the raters text is also an important indicator of human evaluations
did not exceed r = .50 on all items, the ratings were reex- of text quality. Difficult syntactic constructions (syntactic
amined until scores reached the r = .50 threshold. After complexity) include the use of embedded constituents,
the raters had reached an inter-rater reliability of at least r and are often dense, ambiguous, or Ungrammatical [40].
= .50, each rater then independently evaluated the 327 Syntactic complexity is also informed by the density of
essays that comprise the corpus used in this study. particular syntactic patterns, word types and phrase
The Correlations between the raters are located in Ta- types.
ble 1. The raters had the highest correlations for judgmen- Lexical sophistication: Lexical sophistication refers to
ts of Grammar (r=. 824), Spelling (r=. 704), Conclusion (r=. the writer’s use of advanced vocabulary and word choice
747) and Supporting Ideas (r=. 632) and the lowest correla- to convey ideas. Lexical sophistication is captured by as-
tions for Sentence Diversity (r=. 454), Coherence(r=. 516) sessing the type and amount of information provided by
and organization (r=. 504). These results were consistent the words in a text. Words are assessed in terms of rarity
with Crossley and McNamara ‘s findings [49], where the (frequency), abstractness (concreteness), evocation of sen-
deficiency in Grammar, Spelling, Conclusion and Sup- sory images (imagability), salience (familiarity), and
porting Ideas features was easier to identify than Coher- number of associations (meaningfulness). Words can also
ence and Organization. vary in the number of senses they contain (polysemy) or
In addition, it has been found that a relatively larger levels they have in a conceptual hierarchy (hypernymy).
number of negative feedbacks were given to the content Moreover, we proposed and extracted 8 new features
related features, such as Supporting Ideas (49.1%), Coher- that were not available in Coh-Metrix. These features
ence (23.6%), Sentence Diversity (33.0%), Conclusion refer to characteristics of ESL learners’ writing style and
(24.5%), and Organization (20.8%) features. These results reflect on the importance of the introduction section, con-
reveal the common writing problems that most of Eng- clusion section and mechanics in errors including spelling
lish-major students have were content-related. For exam- errors and grammatical errors. In the database, each essay
ple, students have difficulties in providing persuasive is stored as a plain text, where each line is a paragraph.
evidences to support essay arguments, and using more We used Java API to extract the first line and last line text,
complex sentences in the writing. as introduction and conclusion section respectively. For
checking spelling errors, an open source spelling error
checker, called LanguageTool
4 AUTOMATIC FEEDBACK GENERATION (http://www.languagetool.org/), was employed to scan
This section describes our approach to automatic essay each word. For checking grammatical errors, the Link
feedback generation using the supervised machine learn- Grammar Parser [53] was used to check the grammar of a
ing approach based on the annotated dataset described in sentence based on natural language processing technolo-
section 3. Subsection 4.1 presents the feature model was gy. If the link grammar could not generate links (relations
built mainly based on Coh-Metrix [40], [51], while subsec- between pairs of words) after parsing a sentence, this sen-
tion 4.2 describes the evaluation of three common classi- tence would be considered as ungrammatical.
fiers, Naïve Bayes, Support Vector Machine and Decision Number of words in Introduction: the total number of
tree. Subsection 4.3 illustrates the automatic feedback words in the first paragraph considered as the introduc-
generation process. tion section.
Number of words in Conclusion: the total number of
4.1 Textual Feature Extraction
words in the last paragraph considered as the conclusion
We used Coh-Metrix 3.0, retrieving 108 scores of textual section.
features. More information can be found on the website Introduction Portion: the ratios of number of words in
(http://cohmetrix.Memphisedu/cohmetrixpr/index.html). introduction to the total number of words in the docu-
Descriptive indices: It includes the number of para- ment.
graphs, number of sentences, number of words, number Conclusion Portion: the ratios of number of words in
of syllables in words, mean length of paragraphs etc. conclusion to the total number of words in the document.
Cohesion: Cohesion is a key aspect of understanding
TABLE 1: RATERS’ INTER-AGREEMENT AND THE DISTRIBUTION OF FEEDBACK VALUE ON EACH CATEGORY
Essay Feature r Negative (%) Neutral (%) Positive (%)
Grammar .824 17.0 63.2 19.8
Spelling .704 19.8 59.4 20.8
Sentence Diversity .454 33.0 57.6 9.4
Conclusion .747 24.5 54.7 20.8
Supporting Ideas .632 49.1 37.7 13.2
Coherence .516 23.6 58.5 17.9
Organization .504 20.8 36.7 42.5
information.
**************************************************************************************************
Spelling errors: the number of spelling errors. We em- sion is the percentage of negative instances to the size of
ployed an open source spelling error checker called Lan- the dataset, while the recall is 1.
guageTool (http://www.languagetool.org/), which was Table 2 shows that the J48 outperforms the rest of clas-
part of the OpenOffice suite. sifiers across seven essay features. Grammar and Spelling
Grammatical errors: the number of sentences with features were easier to be classified (F-score is above 0.820
grammatical errors. We used the Link Grammar Parser across three classifiers) since they did not require deep
[53] to check the grammar of a sentence, which was also text understanding. Regarding to the rest of features, the
widely used in ESL context. sentence diversity and supporting idea features were eas-
Percentage of spelling errors: the ratios of the number ier to be detected by using J48, where those f-score was
of word spelling errors to the total number of words in above 0.500. Conclusion, Organization and Coherence
the document. category were generally difficult to be classified, where f-
Percentage of grammatical errors: the ratios of the score for conclusion was 0.485, f-score for coherence was
number of sentence with grammatical errors to the total 0.439 and f-score for organization was 0.405. Some find-
number of sentences in the document. ings were consisted with Crossley and McNamara’s study
Therefore, there are totally 116 features extracted from [54]. The cohesive features extracted by the Coh-Metrix
each essay. were not sufficient predictors to the quality of the coher-
ence rated by the human teachers.
4.2 Automatic Negative Feedback Classification
Result 4.3 Automatic Feedback Generation System
Since pointing out the weaknesses of the essays is im- Similar to Glosser [22], SAM is a web-based automatic
portant in the written feedback, our approach emphasizes feedback generation system including indirect corrective
on automatic negative feedback classification in one of feedback generation, assignment management and online
seven essay features (Grammar, Spelling, Sentence Diver- text editing. The system contains seven negative feedback
sity, Conclusion, Supporting Ideas, Coherence and Or- classifiers for seven different essay features implemented
ganization). Teachers annotated one feature of a student in J48 algorithm described in section 4.2 and one text edi-
essay as positive, neutral or negative feedback, which is tor with the revision control ability.
mentioned in Section 3. In order to improve the perfor- The automatic feedback generation process is de-
mance of the system classification, we modified each in- scribed as follows. A student firstly writes an essay in the
stance, one feature of a student essay, labeled as either online text editor. After the student finishes the writing,
negative or non-negative feedback. In other words, previ- the feedback classifier then classifies the essay regarding
ous labeled positive feedback or neutral feedback to particular essay feature. If the classification result is
changed to non-negative feedback. Precision, recall and F- negative, the system generates indirect corrective feed-
score were used as the evaluation matrix. Precision is the back questions. Two English teachers those who annotat-
fraction of system-classified instances that are correctly ed the teachers’ comments predefined feedback questions
classified as negative feedback, while recall is the fraction for each essay feature based on their teaching experience
of the negative feedback that are detected by the system. (See Table 3). This feedback suggests students to “double-
The F-score is calculated as follows: check” their essay with several trigger questions provid-
ed. It tends to help students reflect on what they have
2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 written. Currently, the system selects all the automatically
𝐹! =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 classified negative feedbacks to the students (See Figure
The ten-fold cross-validation was used to evaluate the 2).
classification performance of three classifiers. In addition, We used the Etherpad (www.etherpad.org) to imple-
we defined a baseline for the performance evaluation of ment the online text editor embedded in the system.
classifiers. The baseline classifier assumes that all the in- Etherpad is a highly customizable open source online
stances are classified as negative feedback, so the preci- editor. Like Google Docs, the text editor can automatically
TABLE 2: THE CLASSIFICATION RESULT OF THE FOUR CLASSIFIERS IN NEGATIVE FEEDBACK.

J48 Naïve Bayes SMO Baseline
Essay Feature P R F P R F P R F P R F
Grammar 0.850 0.804 0.820 0.650 0.604 0.620 0.704 0.726 0.711 0.170 1 0.291
Spelling 0.984 0.984 0.984 0.898 0.841 0.869 0.841 0.921 0.879 0.594 1 0.745
Sentence
0.532 0.547 0.524 0.360 0.257 0.300 0.483 0.491 0.486 0.330 1 0.496
Diversity
Conclusion 0.490 0.491 0.485 0.243 0.346 0.286 0.424 0.425 0.421 0.245 1 0.394
Supporting
0.522 0.462 0.490 0.474 0.519 0.495 0.458 0.423 0.440 0.491 1 0.659
Ideas
Coherence 0.423 0.462 0.439 0.136 0.158 0.146 0.381 0.358 0.369 0.179 1 0.304
Organization 0.404 0.406 0.405 0.130 0.136 0.133 0.471 0.364 0.410 0.208 1 0.344
Note: P refers to the precision, R recall while F is F-score
information.
****************************************
stores the revision history of an edited document, which Figure 2 shows an example of the system-generated
allows us to easily access the number of revisions and the feedback about the coherence issue. The following prede-
number of edited words in a revision. In addition, we fined feedback questions are used for the self-correction:
installed the reference plugin in the text editor, which 1) Does your topic sentence connect to the previous para-
provides students an easy access to the concept of a par- graph?
ticular essay feature, such as organization and coherence, 2) Do you understand how each paragraph and sentence
so that the student can understand the concept well. follows from the previous one?
In addition, the system allows teachers to give direct 3) The communication of your ideas needs more strong
feedback on a specific text segment of the essay in the cohesive cues and devices, such as transitions and
Etherpad text editor using the comment plugin. Like connective phrases that link ideas. Do you use them
working on the comment function in the Microsoft Word, correctly?
the student can accept changes based on the comments. The student checks her essay based on the feedback
TABLE 3: EXAMPLES OF ICF ON SEVEN ASPECTS OF THE WRITING
Essay Feature Examples
Grammar Do you use the grammar check function in Microsoft Word? Are you sure each sentence has a subject and
a verb? Do subjects and verbs agree with one another in number in your sentences?
Spelling Do you adopt the spelling that your dictionary gives first in any entry with variant spellings? Do you use
the spell check function in Microsoft Word? Make sure you double-check for capitalization errors and mis-
spelled words in the essay
Sentence Do you use a variety of short and long sentences? Try to use condensed and split sentences in turn. Avoid
Diversity excessively long sentences (25 words or more). If you find yourself pausing for breath when reading aloud
a sentence, it may already be too long.
Conclusion Does your conclusion bring all of your ideas together and make a concise summary of main points? Does
your conclusion restate the thesis in different words or phrases? Don’t introduce new concepts that were
not addressed previously.
Supporting Do your body paragraphs rely heavily on unsupported hypothetical and opinion-based claims rather than
Ideas fact-based claims? Do not make too many overgeneralizations. You can always add more elaboration, fac-
tual information, persuasive evidence, and specific examples to get your point across.
Coherence Does your topic sentence connect to the previous paragraph? Do you understand how each paragraph and
sentence follows from the previous one? The communication of your ideas needs more strong cohesive
cues and devices, such as transitions and connective phrases that link ideas. Do you use them correctly?
Organization Planning before an essay will get you a clearer structure and a higher grade. Mare sure your introduction
sets out a robust position of your own and previews some of the ideas to be developed in the body. Is your
paragraph organization linear and explicit?
Figure 2: Screenshot of indirect corrective feedback generated by the system

information.
**************************************************************************************************
Figure 3: Screenshot of direct corrective feedback given by a teacher
provided and try to improve the coherence aspects of the pants’ prior knowledge about writing by analyzing their
essay. If the student does not understand the importance scores in the first year writing class. An independent
of the coherence well, then he or she gets more informa- sample t-test was conducted with 95% confidence interval
tion about the coherence by clicking the Coherence link to compare the scores obtained in DCF and ICF class.
(See Figure 2). This approach is similar to the Glosser [22], There was no significant difference in the scores for DCF
where the feedback or trigger questions do not contain (M=71.91, SD=11.05) class and ICF (M=73.50, SD=11.60)
specific answers. They are intended for students to per- conditions (t(108)=.737, p=.463).
form self-reflection with the trigger questions provided. The writing task includes two main stages. In the first
Moreover, the human teacher can give corrective stage, students were asked to write a pervasive essay in
comments on the essay through the system. Figure 3 40 minutes during the class. The essay question is the
shows an example of a teacher’s the direct corrective same as the first study described in section 3. Then, those
comments on the same student essay. In the example, two teachers gave comments for the DCF class (Each
regarding to the essay coherence, several connectives teacher gave feedback for 27 essays), while the system
were misused. In this case, the teacher highlights the ‘so’ automatically generated ICF for the ICF class. Students
and comments on it for correction shown on the right had one week to revise their drafts based on the feedback.
hand side of the screen. In the next week, students submitted their revised ver-
sion in the system, and then those two teachers scored
each essay feature of 110 essays, based on their previous
5 USER STUDY
annotation experience described in section 3. Finally, all
The aim of this study is to investigate the impact of the the student participants responded to a survey and some
system-generated indirect corrective feedback and hu- of them accepted an informal interview.
man-teacher given direct corrective feedback in the revi-
sion. Particularly, we have analyzed the improvement of 5.2 Results and Discussion
essay quality regarding to seven essay features, the stu-
5.2.1 Feedback Generation
dent perception of the feedback and computer recorded
revision history in two different feedback classes. Two types of feedback generated by the system and hu-
man teachers covered both form (e.g. grammar) and con-
5.1 participants and procedure tent Feedback Generation. As we expected, teachers gen-
Like the first study, participants were second year English erated more feedback than the system did since teachers
major students at Southwest University in China, who could give specific feedback in multiple specific places in
enrolled in two English comprehensive classes in 2016. a student essay regarding to a type of problem, such as
Two English teachers were volunteered to participate this Grammar (See Table 4). In addition, similar to the previ-
study. Those teachers also participated in teachers’ com- ous findings described in section 3, we observed that
ment annotation experiment. In the traditional DCF class more feedbacks were related to the content related issues
(n=54, 50 females, 4 males), each student received only in Sentence Diversity, Organization, Supporting Ideas and
DCF made by those teachers regarding to the seven fea- Coherence features
tures. In the ICF class (n=56, 48 females, 8 males), stu-
dents received only ICF that were automatically generat-
ed by the system. In addition, we examined the partici-
information.
****************************************
TABLE 4: THE NUMBER OF FEEDBACK GENERATED FROM TWO CLASSES

Spelling Grammar Coherence
Supporting Sentence Conclusion
Organization
Ideas Diversity
DCF 11 15 21 16 19 42 23
ICF 6 10 18 13 16 38 14
TABLE 5: THE IMPACT OF THE FEEDBACK ON DIFFERENT ESSAY FEATURES IN THE ICF AND DCF CLASS. THE VALUE OF
MEAN DIFFERENCE CALCULATED BY THE SECOND DRAFT SCORE MINUS THE FIRST DRAFT SCORE.
Essay Feature Class Mean Difference Standard Deviation t N Sig. (Two Tails)
Spelling ICF 0.11 0.066 1.627 55 .110

DCF 0.44 0.094 4.724 53 <.001
Grammar ICF 0.13 0.075 1.729 55 .089
DCF 0.29 0.061 4.690 53 <.001
Coherence ICF 0.54 0.105 5.104 55 <.001
DCF 0.63 0.085 7.423 53 <.001
Conclusion ICF 0.57 0.076 7.535 55 <.001
DCF 0.30 0.098 3.036 53 .004
Supporting Ideas ICF 0.46 0.071 5.201 55 <.001
DCF 0.54 0.094 5.698 53 <.001
Sentence ICF 0.07 0.088 0.814 55 .419
Diversity DCF 0.06 0.077 0.724 53 .472
Organization ICF 0.39 0.110 3.567 55 <.001
DCF 0.87 0.095 9.112 53 <.001
organization, structure, coherence, supporting ideas and con-
5.2.2 The Improvement of Essay Quality in Seven clusion across both classes. ICF points out the way to im-
Aspects prove content related aspects of the writing. In fact, the
Students first wrote the first draft, then revised it based majority of the students from the ICF class agreed that the
on the feedback received, and finally produced the feedbacks on these aspects were helpful. Instead of revis-
second draft. The human teacher scored each aspect of ing only to remove errors, writers try to reconsider and
the second draft. Table 5 shows that the quality of the first refine the whole text. The findings are consistent with
draft in terms of seven features was improved across both those in Beuningen and Kuiken’s [7] study, which
classes. Paired t-tests with 95% confidence interval re- showed that both DCF and ICF were an effective means
vealed that the improvement in spelling and grammar to improve the overall quality of the student writing from
were significant in the DCF class, t (53)=4.724, p<0.001 in an initial writing task to its revision. This result also high-
spelling, and t (53)=4.690, p<0.001 in grammar. But, no lighted the importance of giving feedback on both content
significant improvement on these two aspects was obser- (structure, organization, supporting ideas, conclusion)
ved in the ICF class, t (55)=1.625, p>0.05 in spelling, and and form (grammar and spelling), which resulted in a
t (55)=1.729, p>0.05 in grammar. Moreover, statistical better performance [4], [8].
tests were conducted to examine the differences between Regarding to the sentence diversity, no significant im-
ICF and DCF class. Independent t-test showed significant provement were found in the both class, t (55)=0.814,
differences, t (106)=18.300, p<0.001 in grammar and, t p>0.05 in ICF class, and t (53)=0.724, p>0.05 in DCF class.
(106)=12.236, p<0.001 in spelling between DCF and ICF Most of the students acknowledged that it was difficult to
class. The study results indicate that students improved use the feedback for constructing complex sentences due
their linguistic accuracy better when DCF strategy was to their linguistic limitation, such as a lack of knowledge
applied rather than ICF. This finding is along with Ferris about vocabulary and complex grammar, even with the
[14] who reported that direct error correction led to better teacher feedback shown as follows.
grammatical accuracy than indirect error feedback.
Regarding to the content related errors, we observed Don't use too many short sentences. Try to use more complex
that student writers like to use more assertions (e.g. we sentences like compound sentences and subordinate clauses.
must/we should) and rather than factual evidence to sup-
port the essay argument. In addition, the illogical organi- Although this feedback is related to a specific content,
zation is another big issue. The main reason for these the student still was not sure how to revise it due to her
problems is that Chinese ESL writers have deductive linguistic limitation.
thinking [55], and the reader’s responsibility opinion (It is In sum, these results indicated those two types of
the reader’s responsibility to connect the ideas, under- feedback were useful to improve the quality of the writ-
stand the context and draw some conclusions) in the writing in terms of the essay organization, coherence, sup-
ing[56]. The statistical t-test results show that the second porting ideas and conclusion. Students who received DCF
draft significantly outperformed the first draft in terms of had significantly better improvement than those who re-
information.
**************************************************************************************************
ceived ICF in relation to the Grammar and Spelling fea- shows that students receiving ICF made more text chang-
tures. es than those students getting DCF regarding to the num-
ber of edited words (52 in ICF class, 36 in DCF class) and
5.2.3 Student Perception of the Feedbacks the number of revisions (108 in ICF class, 79 in DCF
After the study, students in both classes were asked to class). Independent sample t-test results revealed that
separately identify their level of agreement on that the students in ICF class made significantly more changes
feedback is helpful on each of seven essay features. The than those students in DCF class, t (108)=5.798, p <0.001
five-point likert scale rate was used to rate the level of in the number of edited words; t (108)=10.600, p. <0.001 in
agreement. In addition to the questionnaire, informal user the number of revisions. These results indicated that stu-
feedback was collected from students in both classes. dents in the ICF class spent more time on revising the
In the DCF class, more than 73% participants agreed or essay than those students in the DCF class did. The main
strongly agreed that the feedback was helpful on Gram- reason for this is that the ICF encourages students to be
mar, Spelling, Conclusion, Supporting Ideas, Coherence and more active in their use of feedback [12].
organization; 62% participants agreed or strongly agreed
that the feedbacks were helpful on sentence diversity. 6 CONCLUSION AND FUTURE WORK
Most of the students interviewed in the DCF class said
The most significant findings of this study indicated that
that the DCF was helpful in the sense that it pointed out
the system-generated ICF was useful to improve the qual-
their errors in writing, so that student knew what errors
ity of writing, particularly in the structure, organization,
they had made. However, when asked if the feedback
supporting ideas, coherence, and conclusion aspects of the
was useful to improve their writing skills, doubts were
essay in the short term. In addition, the ICF encouraged
expressed: Sometimes, I don’t understand teacher comments,
students to spend more time on performing self-
so just quickly accept the changes suggested by the teacher. I
correction than the DCF provided by teachers. These
might make the same mistakes in the next composition.
finds were consisted with previous study results, which
reported positive impacts of ICF on the ability of students
In the ICF class, more than 65% participants agreed or
to edit their own composition and to improve levels of
strongly agreed that the feedback was helpful on Conclu-
accuracy in writing [10], and the effectiveness of the com-
sion, Supporting Ideas, Coherence and Organization. Forty-
bined both form and content focused feedback in improv-
eight percent of participants agreed or strongly agreed
ing the writing development [4], [8]. Although findings
that the feedback was helpful on grammar, spelling and
on the effectiveness of different feedback types have been
sentence diversity. Some students interviewed said that
conflicting, largely due to the widely varying student
they liked the ways to explore solutions by themself.
populations, types of writing and feedback practices ex-
However, other students hoped that the feedback could
amined [9], this study result implied that the system
be more specific. Moreover, students complained that the
could be useful for Chinese English-major students with
feedbacks were too numerous if four to seven aspects of
advanced metalinguistic knowledge, particularly in con-
their essays were required to double check.
tent development. We also observed that some incorrect
Overall, most of students in the DCF class agreed with
ICF were generated. But, the incorrect feedback might be
the usefulness of the feedback on seven aspects of the
particularly beneficial if they promote noticing. The sys-
essay, while those students in the ICF class found that the
tem required the students to consider the weakness of the
feedback was particularly useful in supporting ideas, or-
essay and evaluate them in context to determine whether
ganization, coherence and conclusion. Less positive feed-
they are correct.
back on Grammar, Spelling and Sentence Diversity in the
This study has some limitations. The system-
ICF class were received since they were too general. This
generated feedback could be too general for correcting
result is in line with Miceli’s study [15], where students
errors in some aspects of essay, such as Grammar and
felt that ICF was useful in encouraging them to reflect on
Spelling. Moreover, some essays triggered feedback on
aspects of their writing and to develop improvements in
many features, which generated a daunting number of
the content, while DCF to be more helpful when revising
suggestions. This quantity of feedback seemed to under-
syntax and vocabulary.
mine our scaffolding goal by targeting too many essay
5.2.4 Observations of The Computer Generated elements at once. Finally, this study did not examine the
Revision History impact of the tool in non-English major students’ writing.
The system keeps a revision history of each document. Non-English major students do not have good metalin-
Based on this information, we can analyze how much text guistic knowledge, so the impact of the feedback might be
changes were made in terms of number of revisions and different.
text edit, when student received the feedback. Text edit In the future work, we will focus on how to combine
refers to create, modified or remove text action. Table 6 both DCF and ICF together to support different aspects of
TABLE 6: TEXT CHANGES MADE IN THE REVISION
Class N Average Number of Std. Dev. Of Edit- Average Number Std. Dev. Of Revisions
Edited Words ed Words Of Revisions
ICF 56 52 20.19 108 21.69
DCF 54 36 12.56 79 14.54
information.
****************************************
students’ writing. Particularly, we will examine different Second Language Writing: Contexts and Issues , K.
English parsers, such as Stanford Parser, which it might Hyland and F. Hyland, Eds. New York, 2006, pp. 1–
give more specific feedback on spelling and grammar. We 104.
also will study how much amount of feedback is suffi- [10] D. Ferris and B. Roberts, “Error feedback in L2
cient. Zacharias [57] found that if students had too much writing classes How explicit does it need to be?,” J.
feedback they would feel discouraged and were less like- Second Lang. Writ., vol. 10, no. 3, pp. 161–184,
ly to be motivated to use it for revision. This could be lim- 2001.
ited based on the severity and/or number of negative [11] J. Chandler, “The efficacy of various kinds of error
features detected in the text. Lastly, the future work will feedback for improvement in the accuracy and
investigate how to effectively use this tool for teaching
fluency of L2 student writing,” J. Second Lang.
and encouraging independent editing skills. Ferris [58]
Writ., vol. 12, no. 3, pp. 267–296, 2003.
and Mu [59] have provided descriptions of procedures for
[12] F. Hyland, “Providing effective support  :
helping students learn to self-edit, and computer-assisted
investigating feedback to distance language learners,”
feedback could be included in such procedures.
Openg Learn., vol. 16, no. 3, pp. 233–237, 2001.
ACKNOWLEDGEMENT [13] D. Little, “Learner autonomy and second/foreign
language learning,” Subject Centre for Languages,
This work is supported by Chongqing Social Science
Linguistics and Area Studies, 2003. [Online].
Planning Fund Program under grant No. 2014BS123,
Available: http//www.llas.ac.uk/resources/gpg/1409.
Fundamental Research Funds for the Central Universities
under grant No. SWU114005, No. XDJK2014A002 and No.
[14] D. Ferris, Treatment of error in second language
XDJK2014C141, CQU903005203326, and National Natural student writing. Ann Arbor: The University of
Science Foundation of China (61502397) and the Scientific Michigan Press, 2002.
Research Foundation for the Returned Overseas Chinese [15] T. Miceli, “Foreign Language Students’ Perceptions
Scholars, State Education Ministry. of Reflective Approach to Text Correction,” Finders
Univ. Lang. Gr. Online Rev., vol. 3, no. 1, pp. 25–36,
2006.
[16] Y. Attali and J. Burstein, “Automated Essay Scoring
REFERENCES: With e-rater V.2.,” J. Technol. Learn. Assess., vol. 4,
[1] N. Bureau of Statistis of China, China Statistical no. 3, 2006.
YearBook. China Statistics Press, 2013. [17] S. Elliot, “Intellimetric: From here to validity,” in
[2] L. Brannon and C. H. Knoblauch, “On students’ Automated essay scoring: A cross-disciplinary
rights to their own texts: A model of teacher perspective, Mahwah, NJ: Lawrence Erlbaum
response,” Coll. Compos. Commun., vol. 33, no. 2, Associates, 2003, pp. 71–86.
pp. 157–166, 1982. [18] J. Burstein, M. Chodorow, and C. Leacock,
[3] I. Leki, “The preferences of ESL students for error “Automated essay evaluation: The Criterion online
correction in college-level writing classes,” Foreign writing service,” AI Mag., vol. 25, p. 27, 2004.
Lang. Ann., vol. 24, pp. 203–218, 1991. [19] Richard Haswell, “The complexities of responding
[4] A. K. Fathman and E. Whalley, “Teacher Response to student writing; or, looking for shortcuts via the
to Student Writing- Focus on Form versus Content,” road of excess,” Across Discip., vol. 3, 2006.
in Second Language Writing: Research Insights for [20] S. C. Weigle, “English as a second language writing,”
the Classroom, 1990, pp. 178–190. in Handbook of automated essay evaluation, 2013,
[5] D. Ferris, Response to student writing: Implications pp. 36–53.
for second language students. Mahwah, NJ : [21] K. Yancey, A. Lunsford, J. McDonald, C. Moran, M.
Lawrence Erlbaum Associates, 2003. Neal, C. Pryor, D. Roen, and C. Selfe, “CCCC
[6] I. Lee, “Error correction in L2 secondary writing position statement on teaching, learning, and
classrooms: The case of Hong Kong,” J. Second assessing writing in digital environments,” Coll.
Lang. Writ., vol. 13, pp. 285–312, 2004. Compos. Commun., vol. 55, no. 4, pp. 785–790,
[7] C. G. van Beuningen, N. H. de Jong, and F. Kuiken, 2004.
“The Effect of Direct and Indirect Corrective [22] J. Villalon, P. Kearney, R. A. Calvo, and P.
Feedback on L2 Learners’ Written Accuracy,” ITL Reimann, “Glosser: Enhanced Feedback for Student
Int. J. Appl. Linguist., vol. 156, pp. 279–296, 2008. Writing Tasks,” 2008, pp. 454–458.
[8] T. Ashwell, “Patterns of Teacher Response to [23] M. Liu, R. A. Calvo, and V. Rus, “Automatic
Student Writing in a Multiple-Draft Composition Question Generation for Literature Review Writing
Classroom: Is Content Feedback Followed by Form Support,” 2010.
Feedback the Best Method?,” J. Second Lang. Writ., [24] D. S. McNamara, S. a Crossley, and R. Roscoe,
vol. 9, no. 3, pp. 227–257, 2000. “Natural language processing in an intelligent
[9] D. R. Ferris, “Does error feedback help student writing strategy tutoring system.,” Behav. Res.
writers? New evidence on the short- and long-term Methods, vol. 45, no. 2, pp. 499–515, Jun. 2013.
effects of written correction,” in Feedback on [25] M. Liu, R. Calvo, and V. Rus, “Automatic
information.
**************************************************************************************************
Generation and Ranking of Questions for Critical Erlbaum, 1999.

Review,” Educ. Technol. Soc., vol. 17, no. 2, pp. [42] R. M. Rufenacht, P. M. McCarthy, and T. A.
333–346, 2014. Lamkin, “Fairy Tales and ESL Texts: An Analysis
[26] T. K. Landauer, D. S. McNamara, S. Dennis, and W. of Linguistic Features Using the Gramulator,” in
Kintsch, Handbook of Latent Semantic Analysis. Proceedings of the Twenty-Fourth International
Lawrence Erlbaum, 2007. Florida Artificial Intelligence Research Society
[27] L. M. Rudner, “Reducing Errors Due to the Use of Conference, 2011, pp. 287–292.
Judges.,” Pract. Assessment, Res. Eval., vol. 3, no. 3, [43] S. Crossley and D. McNamara, “Predicting second
1992. language writing proficiency： The role of cohesion，
[28] S. Dikli, “Automated Essay Scoring,” Turkish readability， and lexical difficulty,” J. Res. Read.,
Online J. Distance Educ., vol. 7, no. 1, pp. 49–62, vol. 35, pp. 115–135, 2012.
2006. [44] S. a. Crossley and D. S. McNamara, “Text
[29] M. A. Hearst, “The debate on automated essay Coherence and Judgments of Essay Quality: Models
grading,” Intell. Syst. Their Appl., vol. 15, no. 5, pp. of Quality and Coherence,” in The 33rd Annual
22–37, 2000. Conference of the Cognitive Science Society, 2011.
[30] D. Wade-Stein and E. Kintsch, “Summary Street: [45] C. Alderson, Diagnosing foreign language
Interactive computer support for writing,” Cogn. proficiency: The interface between learning and
Instr., vol. 22, no. 3, pp. 333–362, 2004. assessment. London, UK: Continuum, 2005.
[31] M. Shermis and J. Burstein, Automated Essay [46] B. North, “Scales for rating language performance:
Scoring: A Cross-Disciplinary Perspective. Mahwah Descriptive models, formulation styles, and
NJ: Lawrence Erlbaum Associates, 2003. presentation formats,” TOEFL Monogr., vol. 24,
[32] P. F. Ericsson and R. Haswell, “Machine Scoring of 2003.
Human Essays: Truth and Consequences.” Utah [47] U. Knoch, “Rating scales for diagnostic assessment
State University Press, 2006. of writing: What should they look like and where
[33] G. Gibbs and C. Simpson, “ Conditions under which should the criteria come from?,” Assess. Writ., vol.
assessment supports students’ learning Learning and 16, no. 2, pp. 81–96, 2011.
Teaching in Higher Education,” Learn. Teach. High. [48] I. Lee, “Feedback in Hong Kong secondary writing
Educ., vol. 1, no. 1, pp. 3–31, 2004. classrooms: Assessment for learning or assessment
[34] R. A. CALVO and R. A. ELLIS, “Students’ of learning?,” Assess. Writ., vol. 12, no. 3, pp. 180–
Conceptions of Tutor and Automated Feedback in 198, 2007.
Professional Writing.,” J. Eng. Educ., vol. 99, pp. [49] S. Crossley and D. McNamara, “Cohesion,
427–438, 2010. coherence, and expert evaluations of writing
[35] E. C. Thiesmeyer and J. E. Theismeyer, Editor:A proficiency,” in The 32nd Annual Conference of the
System for Checking Usage, Mechanics, Vocabulary, Cognitive Science Society, 2010, pp. 984–989.
and Structure. New York, New York, USA: Modern [50] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P.
Language Association, 1990. Reutemann, and I. H. Witten, “The WEKA data
[36] J Anderson, Mechanically Inclined:Building mining software,” ACM SIGKDD Explor., vol. 11,
Grammar, Usage, and Style into Writer’s Workshop. pp. 10–18, 2009.
Stenhouse Publishers, 2005. [51] D. S. . McNamara, A. C. . c Graesser, P. M. .
[37] T. J. Beals, “Between Teachers and Computers: McCarthy, and Z. . Cai, Automated evaluation of
Does Text-Checking Software Rea lly Improve text and discourse with Coh-Metrix. 2012.
Student Writing?,” English J., pp. 67–72, 1998. [52] W. Kintsch and T. van Dijk, “Towards a model of
[38] M. A. Britt, P. Wiemer-Hastings, A. A. Larson, and text comprehension and production,” Psychol. Rev.,
C. A. Perfetti, “Using Intelligent Feedback to vol. 85, pp. 363–394, 1978.
Improve Sourcing and Integration in Students’ [53] J. Lafferty, D. Sleator, and D. Temperley,
Essays,” Int. J. Artif. Intell. Ed., vol. 14, no. 3,4, pp. “Grammatical Trigrams: A Probabilistic Model of
359–374, 2004. Link Grammar,” in Proceedings of the AAAI
[39] T. Kakkonen and E. Sutinen, “EssayAid: towards a Conference on Probabilistic Approaches to Natural
semi-automatic system for assessing student texts,” Language, 1992.
Int. J. Contin. Eng. Educ. Life-Long Learn., vol. 21, [54] S. a. Crossley and D. S. McNamara, “Understanding
no. 2/3, pp. 119–139, 2011. expert ratings of essay quality: Coh-Metrix analyses
[40] A. C. Graesser, D. S. McNamara, M. M. Louwerse, of first and second language writing,” Int. J. Contin.
and Z. Cai, “Coh-metrix: analysis of text on Eng. Educ. Life-Long Learn., vol. 21, no. 2/3, p. 170,
cohesion and language.,” Behav. Res. methods, 2011.
instruments, Comput., vol. 36, no. 2, pp. 193–202, [55] L. Jinghong, “The impact of the English Writing
May 2004. Theories on Chinese English Second Language
[41] J. W. Pennebaker and M. E. Francis, Linguistic Writing,” Foreign Lang. Teach., no. 2, pp. 41–47,
inquiry and word count (LIWC). Mahwah, NJ: 2006.
information.
****************************************
[56] J. Hinds, “Quasi-Inductive:Expository Writing in

Japanese, Korean, Chinese and Thai,” in Coherence Dr. Li Liu is associate professor at
in Writing Research and Pedagogical Perspectives, Chongqing University. He is also
1990. serving as a Senior Research Fellow
[57] N. T. Zacharias, “Teacher and Student Attitudes of School of Computing at the Na-
toward Teacher Feedback,” RELC J., vol. 38, no. 1, tional University of Singapore. Li
pp. 38–52, 2007. received his Ph.D. in Computer Sci-
[58] D. R. Ferris, H. Liu, A. Sinha, and M. Senna, ence from the Université Paris-sud XI
“Written corrective feedback for individual L2 in 2008. He had served as an associ-
ate professor at Lanzhou University
writers,” J. Second Lang. Writ., vol. 22, no. 3, pp.
in China. His research interests are in
307–329, 2013.
pattern recognition, data analysis, and their applications
[59] C. Mu, “A Taxonomy of ESL Writing Strategies,”
on human behaviors. He aims to contribute in interdisci-
Proc. Redesigning Pedagog. Res. Policy, Pract., pp.
plinary research of computer science and human related
1–10, 2005. disciplines. Li has published widely in conferences and
journals with more than 50 peer-reviewed publications. Li
Dr. Ming Liu is Associate Profes- has been the Principal Investigator of several funded pro-
sor at School of Computer and In- jects from government and industry.
formation Science, Southwest Uni-
versity, China. He received the PhD
in Artificial Intelligence in Educa-
tion at the School of Electrical and
Information Engineering, The Uni-
versity of Sydney, Australia in 2012.
His main research interests include
question generation, learning ana-
lytics and intelligent tutoring system. He participated in
national and international projects funded by ARC Link-
age (Australia), Young and Well CRC, Office of Teaching
and Learning, Google and Chinese National Fund. He is
an author of over 20 publications papers in prestigious
conferences and journals, such as Intelligent Tutoring
Systems, IEEE transactions on Learning Technologies,
Journal of Educational Technology and Society.
Dr. Yi Li is an associate professor

at Faculty of Education in the
Southwest University in China. She
has a master in applied statistics and
a PhD in educational measurement
by the Purdue University in the
USA. Her research interest is appli-
cation of quantitative methodology
to explore, understand and influence the technology in-
novation in the field of education. She has participated in
the national and international projects funded by Nation-
al Science Foundation and National Social Science Foun-
dation of China. Yi is author of many research publica-
tions in prestigious international conferences and jour-
nals.
Dr. Weiwei Xu is associate professor at

College of International Studies,
Southwest University in China. She
holds a doctorate degree in English
from Macquarie University, Australia.
She acquired her MA degree in English
from Newcastle University, UK. She
has been teaching English academic
writing for 8 years.
information.

Automated Essay Feedback Generation and Its Impact in The Revision

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automated Essay Feedback Generation and Its Impact in The Revision

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1

Automated Essay Feedback Generation and

Ming Liu, Yi Li, Weiwei Xu and Li Liu

Index Terms—Writing Feedback, Text Analysis, Natural Language Processing

W ith the coming of the 21st century and the globaliza-

TABLE 2: THE CLASSIFICATION RESULT OF THE FOUR CLASSIFIERS IN NEGATIVE FEEDBACK.

Figure 2: Screenshot of indirect corrective feedback generated by the system

Figure 3: Screenshot of direct corrective feedback given by a teacher

TABLE 4: THE NUMBER OF FEEDBACK GENERATED FROM TWO CLASSES

Spelling ICF 0.11 0.066 1.627 55 .110

Generation and Ranking of Questions for Critical Erlbaum, 1999.

[56] J. Hinds, “Quasi-Inductive:Expository Writing in

Dr. Yi Li is an associate professor

Dr. Weiwei Xu is associate professor at

You might also like