Automated Essay Evaluation With Semantic Analysis

Knowledge-Based Systems 120 (2017) 118–132
Contents lists available at ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
Automated essay evaluation with semantic analysis

Kaja Zupanc∗, Zoran Bosnić
University of Ljubljana, Faculty of Computer and Information Science, Večna pot 113, Ljubljana, Slovenia
a r t i c l e i n f o a b s t r a c t
Article history: Essays are considered as the most useful tool to assess learning outcomes, guide students’ learning pro-
Received 24 March 2016 cess and to measure their progress. Manual grading of students’ essays is a time-consuming process,
Revised 12 October 2016
but is nevertheless necessary. Automated essay evaluation represents a practical solution to this task,
Accepted 1 January 2017
however, its main weakness is the predominant focus on vocabulary and text syntax, and limited consid-
Available online 3 January 2017
eration of text semantics. In this work, we propose an extension of existing automated essay evaluation
Keywords: systems by incorporating additional semantic coherence and consistency attributes. We design the novel
Automated scoring coherence attributes by transforming sequential parts of an essay into the semantic space and measuring
Essay evaluation changes between them to estimate coherence of the text. The novel consistency attributes detect seman-
Natural language processing tic errors using information extraction and logic reasoning. The resulting system (named SAGE - Semantic
Semantic attributes Automated Grader for Essays) provides semantic feedback for the writer and achieves significantly higher
Semantic feedback
grading accuracy compared with 9 other state-of-the-art automated essay evaluation systems.
© 2017 Elsevier B.V. All rights reserved.
1. Introduction port technology in educational settings. Throughout the develop-

ment of the field, several different names have been used for it in-
Essays are short literary compositions on a particular subject terchangeably. The terms automated essay scoring (AES) and auto-
(also referred to as prompt-specific essays), usually in prose and mated essay grading (AEG) slowly became replaced with the term
generally analytic, speculative, or interpretative in nature. Essays automated writing evaluation (AWE) or automated essay evalua-
give students an opportunity to demonstrate their range of skills tion (AEE). The term evaluation within the name (AWE, AEE) sur-
and knowledge, including higher-order thinking skills, such as syn- faced to use because the automated process enables students to
thesis and analysis. Automated essay evaluation (AEE) is the pro- receive constructive feedback about their writing.
cess of evaluating and scoring the written essays via computer Nowadays, the AEE systems are used in combination with hu-
programs [1]. For teachers and educational institutions, AEE rep- man graders in different high-stakes assessments such as the
resents not only a tool to assess learning outcomes, but also helps Graduate Record Examination (GRE), Test of English as a For-
save time, effort and money without lowering the quality. AEE sys- eign Language (TOEFL), Graduate Management Admissions Test
tems can also be used in all other application areas of text mining, (GMAT), SAT, American College Testing (ACT), Test of English for
where the content of the text needs to be graded or prioritized, International Communication (TOEIC), Analytic Writing Assessment
such as: written applications, cover letters, scientific papers, e-mail (AWA), No Child Left Behind (NCLB) and Pearson Test of English
classification etc. (PTE). Furthermore, some of them also act as a sole grader in low-
The field has been developing since the 1960s when Ellis Bat- stakes assessments and learning processes in classrooms.
ten Page and his colleagues [2] proposed the first automated es- The main weakness of existing AEE systems is that they con-
say scoring (AES) system. The system was using basic measures to sider text semantics very vaguely and focus mostly on its syn-
approximate features of interest and thus describe the quality of tax. Although the details of the majority of the systems have
an essay. By the 1990s, the progress in the field of natural lan- never been announced publicly, we can still deduce that they
guage processing (NLP) encouraged researchers to apply new com- mostly perform syntax and shallow content measurements (calcu-
putational techniques to automatically extract essay writing quality lating similarity between texts) and neglect the semantics. To an-
measures. In the last decade, AEE became a well-established sup- alyze semantics, the state-of-the-art systems use latent semantic
analysis (LSA) [3], latent Dirichlet allocation (LDA) [4], and con-
tent vector analysis (CVA) [5]. To measure the coherence of es-
∗
Corresponding author.
says’ content, LSA [6,7], random indexing [8], and an entity-based
E-mail addresses: kaja.zupanc@fri.uni-lj.si (K. Zupanc), zoran.bosnic@fri.uni-lj.si approach [9] have been used. However, only two existing sys-
(Z. Bosnić). tems [10,11] use approaches that partially check for consistency of
http://dx.doi.org/10.1016/j.knosys.2017.01.006
0950-7051/© 2017 Elsevier B.V. All rights reserved.
K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132 119
the statements in the essays. Despite the efforts, the latter systems and semantic attributes. Style attributes focus on lexical sophis-
are not automatic, as they require manual interventions from the tication, grammar and mechanics (spelling, capitalization, and
user. punctuation). Content attributes vaguely describe semantics of
In this paper, we propose an extension to existing state-of-the- an essay and are based on comparing an essay with source text
art AEE systems that incorporates additional novel attributes for and other already graded essays. Semantic attributes are based
measuring coherence (semantic development) and consistency of on verifying the correctness of content meaning.
facts (compared to common sense knowledge and other facts in • Methodology: Different systems use various approaches to ex-
essays). The proposed coherence attributes measure distance, spa- tract attributes from essays. The most widely used methodology
tial patterns, and spatial autocorrelation between parts of the es- is based on NLP. Systems focusing on content are mostly using
say. The consistency attributes measure the number of semantic Latent Semantic Analysis (LSA) - a machine learning method
errors in a student essay using information extraction and logical that analyses related concepts between a set of documents
reasoning. The most significant contribution of the proposed sys- and the contained terms. LSA assumes that words with simi-
tem is that by detecting semantic errors, the system is also able to lar meaning occur in similar parts of text. To evaluate content,
provide semantic feedback about the essay. As discussed in many systems use pattern matching techniques (PMT) and extensions
papers [12–14], the goal of the AEE systems is no longer to ac- to LSA such as Generalized Latent Semantic Analysis (GLSA)
curately reproduce the human graders scores, which are inconsis- (which uses an n-gram-by-document matrix instead of a word-
tent in their grading, but to provide valid scores and consequently by-document matrix) and improvement that considers seman-
also immediate, accurate, and informative feedback. This feedback tics by means of the syntactic and shallow semantic tree ker-
is important so that the students can achieve progress and improve nels [20]. For verifying the correctness and consistency of con-
their writing. We compare our system (called SAGE - Semantic Au- tent, approaches such as (Open) Information Extraction ((O)IE),
tomated Grader for Essays) with existing state-of-the-art systems Semantic Networks (SN), Ontologies, Fuzzy Logic (FL), and De-
and show that SAGE achieves significantly higher grading accuracy scription Logic (DL) are used.
compared with 9 other state-of-the-art automated essay evaluation • Prediction model: The majority of the systems use machine
systems. learning algorithms (usually regression modeling) to predict the
The paper is divided into seven sections. Section 2 describes final grade. An alternative is to use: the Lexile measure - an es-
several subfields of the related work that are relevant to our re- timate of student’s ability to express language in writing based
search. Section 3 describes the augmented essay evaluation system on semantic complexity (level of expressing) and syntactic so-
including the novel attributes, Section 4 presents the implemen- phistication (how the words are combined into sentences); co-
tation and evaluation of the proposed system and Section 5 the sine similarity; and rule-based expert systems.
results of our system SAGE. Section 6 describes the semantic feed-
back and Section 7 draws conclusions. Table 1 provides a comparison of characteristics for the majority
of AEE systems and approaches, including proprietary (non-public)
2. Related work systems, two publicly available systems, and approaches proposed
by the academic community.
The development of the AEE field has been carried out in differ- The majority of the systems in Table 1 were evaluated on a sub-
ent problem areas that we highlight in the following subsections. stantially large set of prompt-specific essays that were pre-scored
by human expert graders. These datasets were divided into train-
2.1. Automated essay evaluation systems ing and test sets. The training set was used to develop the scoring
model of the AEE system. This scoring model was then used to as-
The first AEE system was proposed almost 50 years ago. In sign scores to essays in the test set. The performance of the scor-
1966, the high school English teacher E. Page [2] proposed the ing model was validated by calculating how well the scoring model
Project Essay Grade – the first automated system for grading stu- “replicated” the scores assigned by the human expert graders [24].
dent essays. He saw the system as a solution to reducing hours of
manually grading student essays. Despite its impressive success at
2.2. Measuring coherence of the text
predicting teachers’ essay ratings, the early version of the system
received only limited acceptance in writing and education commu-
Coherence is a concept that describes the flow of information
nity. The availability of necessary tools (home computers, Internet,
from one part of discourse to another and ranges from lower level
computational techniques for automatically extracting measures of
cohesive elements such as coreference, causal relationship, and
writing quality, ...) was poor and the society criticised the idea of
connectives, up to higher level elements that evaluate connections
displacing human graders [15]. The later widespread use of the In-
between the discourse and reader’s mental representation of it [7].
ternet, word processing software and NLP accelerated the develop-
Existing systems measure coherence in noisy text with different
ment of AEE systems.
supervised and unsupervised approaches. The unsupervised ap-
In the past, one of the main obstacles to achieve progress in
proaches usually measure lexical cohesion, i.e. repetition of words
this area was lack of open-source AEE systems, which would al-
and phrases in an essay. Foltz et al. [6,7] assume that coherent
low insight into their grading methodology. Four commercial sys-
texts contain a high number of semantically related words and
tems achieved the predominance in this field: Project Essay Grade
measure coherence as a function of semantic relatedness between
(PEG) [2], E-rater [16], Intelligent Essay Assessor (IEA) [17], and In-
adjacent sentences. Relatedness can be computed using LSA with-
telliMetric [18]. In 2010, Mayfield and Rosé released LightSIDE [19],
out employing syntactic or other annotations. Hearst [35] sub-
an automated evaluation engine with both compiled and source
divides texts into multi-paragraph units that represent subtopics
code publicly available. This program is designed as a tool for non-
and identifies patterns of lexical co-occurrence and distribution, i.e.
experts for a variety of purposes, including essay assessment. In
identifying repetition of vocabulary across adjacent sentences.
order to compare technical, methodological, and structural charac-
The supervised learning approaches require annotated data
teristics of the state-of-the-art systems, we present their compari-
(graded essays). They focus on occurrences of discourse elements
son in Table 1 using the following main criteria:
(e.g. thesis statement, main idea, conclusion), entity sentence roles,
• Type of attributes: The attributes describing the quality of an grammar errors, and word usage. Miltsakaki and Kukich [36] have
essay can roughly be divided into three groups: style, content explored the role of centering theory [37] in locating topic shifts
120 K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132
Table 1
A comparison of the key features of the state-of-the art AEE systems.
Types of attr. Methodology Prediction model System
Statistical Multiple linear regression PEG [21]

Style
NLP Linear regression PS-ME [22]
Linear regression e-rater [16]

Multiple mathematical models IntelliMetric [18]
Bookette [23]
Neural networks
OzEgrader [24]
Style & Content NLP Machine learning CRASE [25]
Statistical model AutoScore [26]
Lexile measure Lexile [27]
Learning to rank Ranked-based AEE [28]
Ensemble classifiers Multi-classifier Fusion AEE [29]
Bayesian networks BETSY [30]

Statistical
Linear regression SEAR [31]
Statistical LightSIDE [32]

------------------------------------------------------------- Machine learning -------------------------------------------------------------
LSA, NLP IEA [17]
Content LSA, tree kernel functions Semantic-tree-based AEE [20]
------------------------------------------------------------- Cosine similarity -------------------------------------------------------------
GLSA GLSA based AEE [33]
NLP, PMT Linear regression Markit [34]
FL, SN Rule-based expert systems SAGrader [11]

Semantic OIE, DL / OBIE-based AEE [10]
OIE, NLP Random forest SAGE
in student essays. Centering theory argues that the discourse in knowledge using logic reasoning [10]. The system extracts state-
a text contains a set of textual segments, each containing dis- ments using Open Information Extraction (OIE) and adds them to
course entities, which are then ranked by their importance. Topic the domain ontology. The extracted tuples are in a form that is
shifts are generated by short-lived topics and are indicative of poor compatible with the OWL ontology. In the final step, the system
topic development. Higgins et al. [8] have developed a system that determines the correctness of a statement through an ontology-
computes similarity across text segments based on their type of based consistency checking. If the domain ontology becomes in-
discourse element and semantic similarity (LSA). A support vec- consistent after the extracted sentence is added into it, then the
tor machine (SVD) uses these features to capture breakdowns in sentence is incorrect with the respect to the domain [10]. Despite
coherence due to relatedness to the essay question and related- many efforts, this system is still not fully automatic, as it requires
ness between discourse elements. More recently, Burstein et al. manual inputs from the user.
[9,38] showed how the Barzilay and Lapata [39] algorithm can
be applied to the domain of student essays. In Barzilay and Lap-
ata’s [39] approach, entities (nouns and pronouns) are represented 2.4. Open information extraction
by their sentence roles and the algorithm counts all possible entity
transitions between adjacent sentences in the text. By combining Information extraction is the task of automatically acquiring
those entity-based features with features related to grammar er- knowledge by transforming natural language text into structured
rors and word usage, Burstein and colleagues [38] improve the per- information, such as a knowledge base [43]. The main tasks of in-
formance of automated coherence prediction for student essays. formation extraction are entity recognition, relation extraction, and
coreference resolution. We focused on a tool for relation extraction
2.3. Detecting semantic errors in an essay called Open Information Extraction (OIE). Wu and Weld [44] define
the OIE system as a function that maps an unstructured document
Only two mentioned systems [10,11] partially check if the state- text d, to a set of triples, {<arg1, rel, arg2>}, where the args are
ments in the essays are correct. SAGrader, developed by Brent [11], noun phrases and rel is a textual fragment indicating an implicit,
was the first AEE system that detected semantic information in an semantic relation between the two noun phrases. Unlike other re-
essay and upon which we based the architecture of our system. lation extraction methods focused on a predefined set of target re-
For SAGrader, the teacher first specifies the assignment prompt lations, the open information extraction paradigm is not limited
and desired features along with relationships among them. Using to a small set of target relations known in advance, but extracts
fuzzy logic, the system recognizes word combinations that can be new types of relations found in the text. The main properties of
used by students to detect desired features and relationships. De- OIE systems are domain independency, reliance on unsupervised
sired knowledge in the form of a semantic network is then com- extraction methods, and scalability to large amounts of text [45].
pared with the knowledge detected in a student’s essay. The sys- Gamallo [45] categorized the existing OIE systems in four
tem scores the student’s essay based on the similarities between groups. First he divided them into two broad categories: systems
observed and desired knowledge using procedural rules. Detailed that require training data to learn a classifier and systems based
feedback indicates what student did right and wrong [11]. on hand-crafted rules or heuristics. In addition, each former cat-
Gutierrez et al. [10,40–42] later proposed a system that not only egory can be divided in two subsequent types: systems that use
detects the desired (correct) knowledge but also detects incorrect the shallow syntactic analysis (e.g. part-of-speech tagging and/or
chunking), and systems that use dependency parsing (transform-

ing sentences into dependency trees).
For implementing semantic consistency checking in our sys-
tem, we used four different OIE systems. One of the systems,
Open IE [46], belongs to the group that needs training data and
uses shallow syntax. The other three systems, ClausIE [47], CSD-
IE [48], and DepOE [49] belong to the group that relies on rules
and uses dependency parsing. We describe their use in detail in
Section 3.3.2. Fig. 1. Transformation of sequential overlapping essay parts into a high dimensional
We also used a system for entity recognition (Illinois Shallow semantic space.
Parser [50]) and two coreference resolution systems (Illinois Coref-
erence Resolution [51] and Stanford Parser [52]) in our system for
it has been shown that essay length significantly influences the hu-
automated semantic error detection. We will further explain their
man rater’s score [56]. We also know that the length of an essay
application in Section 3.3.2.
has the highest influence on the final score. Different readability
measures and variations of the used words also impact the final
2.5. Ontology consistency
score.
To group similar syntax attributes, we divide them into two
An ontology defines a set of representational primitives with
groups: linguistic and content attributes. The linguistic attributes
which one can model a knowledge or discourse. The representa-
describe lexical sophistication and grammatical and mechanical as-
tional primitives are typically classes (or sets), attributes (or prop-
pects of the essay. These attributes are measured by counting all
erties), and relationships (or relations among class members). The
words, long words, different words, and number of words with dif-
definitions of the representational primitives include information
ferent part of speech (PoS) tags. More complex attributes measure
about their meaning and constraints on their logically consistent
the readability level [57,58], lexical diversity [59], and spellcheck-
application [53]. A semantic network is a graphical notation for
ing/capitalization/punctuation errors. The second group are the
representing knowledge in patterns of interconnected nodes and
content attributes, which are based on comparing unseen essays
arcs. To formally represent knowledge and use it in our system
with graded ones. To extract these attributes, we first grouped sim-
we use the Web Ontology Language (OWL) [54]. Description logic
ilarly graded essays and then we compared lexical content of a
(DL) models concepts (e.g. Person), roles (e.g. isMarriedTo) and in-
new essay with the lexical content of already graded essays.
dividuals (e.g. alice, bob), and their relationships (e.g. alice : Person,
All linguistic and content attributes that we implemented in our
(alice,bob) : isMarriedTo). The fundamental modeling concept of a
baseline AES system are presented in Table 2. Overall, we extracted
DL is an axiom - a logical statement that relates roles and/or con-
72 different linguistic and content attributes. For better presenta-
cepts using conjunction, disjunction, existential and value restric-
tion, we further divided linguistic attributes into three subgroups
tions. Several reasoners exist for DL, one of them is HermiT [55].
(lexical sophistication, grammar, mechanics).
The main function of a reasoner is to determine if a given knowl-
edge base (given as an ontology) is consistent. Reasoners detect
classes that are unsatisfiable, i.e. when there is a contradiction in 3.2. Novel semantic coherence attributes
the ontology that implies that the class cannot have any instances
(OWL individuals). While OWL ontologies with unsatisfiable classes We base our coherence attributes on the assumption that the
can be used, inconsistency is a severe error: most OWL reasoners semantic content of a coherent essay changes gradually through
cannot infer any information from an inconsistent ontology. When its text, as it has already been stated by Foltz [7]. We start by first
faced with an inconsistent ontology, they will report this and abort dividing essays into many sequential overlapping parts, obtained
the classification process. by moving a window through an essay by steps of 10 words (illus-
In contrast to other systems, our proposed system (SAGE) fo- trated in Fig. 1). Window’s size is defined so that it contains 25% of
cuses on automatic semantic evaluation and provides semantic the average number of words per essay. For example, if the aver-
feedback to students. It includes semantic attributes that measure age essay length in a dataset was 280 and the length of the essay
coherence as a function of semantic relatedness through the en- was 320 words, we obtained 26 parts.
tire essay and not only between adjacent sentences. It therefore For each essay corpus (dataset) we compute the term frequency
differs from the above described approaches that mainly focus on - inverse document frequency (TF-IDF) representation, which is a
the local coherence. Our system also analyzes text consistency by numerical statistic that reflects how important a word is to a docu-
detecting entities in an essay, their relations and considering coref- ment in a corpus. Term frequency TF(t, d) is computed by counting
erences of concepts. SAGE exploits common sense knowledge on- frequency of a term t in a document d. The inverse document fre-
tologies, taxonomies, and can therefore work on different domains. quency IDF(t, D), which expresses how rare the term t is across all
documents D:
3. Semantic-aware essay evaluation system TF-IDF(t, d, D ) = TF(t, d ) · IDF(t, D )
|{t ∈ d}| |{d ∈ D}|
The motivation for improving current state-of-the-art systems = · log . (1)
is that semantic attributes might improve the grading accuracy. In
|{w ∈ d}| {d ∈ D : t ∈ d}
this section, we present the existing common syntax (linguistic and To compute the TF-IDF vectors of individual essay parts, we modify
content) attributes (Section 3.1), the proposed coherence attributes the computation of the TF-IDF to normalize the weights of words
(Section 3.2), and the proposed consistency attributes (Section 3.3). within each individual essay part with the word frequency of an
entire essay. TF-IDF vectors of essay parts represent points in high-
3.1. Existing syntax (linguistic and content) attributes dimensional semantic space, which should be close to each other
in coherent essays, according to our assumption. An example of an
To implement the baseline AES system we used 72 different essay divided into parts that can be visualized as points in seman-
attributes that were mentioned in all the related literature. High tic space is illustrated in Fig. 1, in which the thin gray lines con-
number of attributes measures different aspects of each essay, e.g. nect the points that represent the sequential parts of an essay. In
Table 2
Linguistic (lexical sophistication, grammar, mechanics) and content attributes.
Lexical sophistication Grammar Mechanics
1. Number of characters, 29. Number of different PoS tags 65. Number of spellchecking errors,
2. Number of words, 30. Height of the tree presenting 66. Number of capitalization errors,
3. Number of long words, sentence structure, 67. Number of punctuation errors,
4. Number of short words, 31. Correct verb form,
5. Most frequent word length, 32. Number of grammar errors,
6. Average word length, Number of each PoS tag
7. Number of sentences, 33. Coordinating conjunction,
8. Number of long sentences, 34. Numeral,
9. Number of short sentences, 35. Determiner,
10. Most frequent sentence length, 36. existential there, Content
11. Average sentence length, 37. Preposition/subordinating 68. Cosine similarity with source text,
12. Number of different words, conjunction, 69. Score point level for maximum cosine similarity over
13. Number of stopwords, 38. Adjective, all score points,
Readability measures [57,58] 39. Comparative adjective, 70. Cosine similarity with essays that have highest score
14. Gunning Fox index, 40. Superlative adjective, point level,
15. Flesch reading ease, 41. Ordinal adjective or numeral, 71. Pattern cosine [5],
16. Flesch Kincaid grade level, 42. Modal auxiliary, 72. Weighted sum of all cosine correlation values [5].
17. Dale-Chall readability formula, 43. Singular or mass common noun,
18. Automated readability index, 44. Plural common noun,
19. Simple measure of Gobbledygook, 45. Singular proper noun,
20. LIX, 46. Plural proper noun,
21. Word variation index, 47. Preposition,
22. Nominal ratio, 48. Participle,
Lexical diversity [59] 49. Predeterminer,
23. Type-token-ratio, 50. Genitive marker,
24. Guiraud;s index, 51. Personal pronoun,
25. Yule;s K, 52. Possessive pronoun,
26. The D estimate, 53. Adverb,
27. Hapax legomena - number of words 54. Comparative adverb,
occurring only once in a text, 55. Superlative adverb,
28. Advanced Guiraud, 56. Particle, “to” as preposition or
infinitive marker,
57. Verb - base form,
58. Verb - past tense,
59. Verb - gerund/present participle,
60. Verb - past participle,
61. Verb - 3rd person sing. present,
62. wh-determiner,
63. wh-pronoun,
64. wh-adverb,
Fig. 2. Construction of semantic coherence attributes.
Fig. 2 we use the same example to illustrate the definition of our proposed attributes are:
semantic coherence attributes, which we explain in the following
- average distance between neighboring points in semantic
subsections.
space (denoted by thin grey lines in Fig. 2a). Foltz [7] has
already shown by measuring cosine similarity between sen-
tences in an essay that highly coherent discourses have small
3.2.1. Basic coherence measures movements in semantic space and vice versa. We defined simi-
Basic coherence measures measure the distance between parts lar attributes that describe the average distance between these
of the essay, which are represented as points in the semantic points;
space. We use two variants of each attribute in this group: one - minimum and maximum distance between neighboring
computed using the Euclidean distance metric and the other com- points and their quotient;
puted using the cosine similarity (in the following we will use the - average distance between any two points, which measures
term “distance” to interchangeably denote any of the two). The how well an idea persists within the essay;
- maximum difference between any two points measures the where dmax is a maximum distance of any point from the cen-
diameter of area that is covered with points and thus the troid. This enables direct comparison of the dispersion of differ-
breadth of the discussed concept in the space (illustrated in ent point patterns from different areas, even if the areas are of
Fig. 2a); varying sizes;
- Clark and Evans’ [60] distance to the nearest neighbor of - the determinant of the distance matrix, a measure of spatial
each point in the semantic space for measuring spatial relation- dispersion. This allows us to measure dispersity of the content
ships:N and consequently how broad the discussed topic is.
i=1 ri √
N 2 N Ni=1 ri
R= = (2) 3.2.3. Spatial autocorrelation
1 N
√ Measures of spatial autocorrelation express how data tends to
2 N be clustered together in space (positive spatial autocorrelation) or
where r is the distance from a given point to its nearest neigh- dispersed (negative spatial autocorrelation). They enable us to de-
bor (see Fig. 2b) and N is the number of points. It is the mea- tect global and local semantic coherence of the essays’ content.
sure of the degree to which the observed distribution differs If the essay exhibits positive spatial autocorrelation, this indicates
from random expectation with respect to the nearest neigh- that it is well structured and that the parts of the essay are well
bor [60]; related to each other.
- average distance to the nearest neighbor, which measures Typical measures of spatial autocorrelation are Moran’s I [61],
how fast an idea develops across an essay (see Fig. 2b); Geary’s C [62], and Getis’s G [63]. We adjusted these three measures
- cumulative frequency distribution G of the nearest neighbors’ so we can use them in our high-dimensional semantic space as
distances: follows:
|r ≤ r | - Moran’s I assesses the overall clustering pattern. The original

G (r ) = (3) measure is intended for a 2-dimensional space, however, in this
N
work, we adjust it to a high-dimensional semantic space by av-
where r is the distance from a given point to its nearest neigh-
eraging it over dimensions:
bor (see Fig. 2b), r is the average distance to the nearest neigh- N N
bor, and N is the number of points. The measure expresses the N 1

n
i=1 j=1 wi j (Dki − Dkc )(Dkj − Dkc )
percentage of content deviations from the main idea. I= · N (6)
S n
i=1 (Di − Dc )
k k 2
k=1
3.2.2. Spatial data analysis where Dki , k = 1, . . . , n; i = 1, . . . , N is a kth coordinate compo-

The second group of attributes describes the spatial characteris- nent of point i, Dkc is a kth coordinate component of a mean
tics of the data and aims to extract implicit knowledge such as spa- center, n is the number of dimensions, N is the number of
tial statistics and patterns. We adjusted a set of descriptive spatial points, and S is a sum of all weights wij . Weights wij are as-
statistics for the use within our representation of the essays. The signed to every pair of points, with value wi j = 1, if i and j are
proposed attributes measure the central spatial tendency and the neighbors, and value wi j = 0 otherwise. The range of I varies
spatial dispersion and are defined as follows: from −1 to +1. A positive sign of I indicates positive spatial
autocorrelation and means that neighboring points cluster to-
- average Euclidean distance between the centroid and each
gether, while the opposite is true for the negative sign. Values
point, which measures an amount of dispersion in a point pat-
close to zero indicate complete spatial randomness.
tern (see Fig. 2c);
- Geary’s C is inversely related to Moran’s I. In this case, the
- minimal and maximal Euclidean distance between the cen-
interaction is not a cross-product of the deviations from the
troid and each point and their coefficient, which measures
mean, but the deviations in intensities of each observation lo-
the biggest content deviation from the main idea;
cation from one another. Again, our adjusted measure is cal-
- standard distance (a spatial equivalent of standard deviation)
culated in a high-dimensional semantic space and is averaged
measures an amount of absolute dispersion in a point pattern:
over all dimensions:
N N
2 (N − 1 ) 1
n wi j (Dki − Dkj )2
n N C= ·
i=1 j=1
k=1 i=1 Dki − Dkc 2 n N N (7)
SD = (4) k=1 i=1 j=1 wi j (Dki − Dkc )2
N
where Dki , k = 1, . . . , n; i = 1, . . . , N is a kth coordinate compo-
where Dki , k = 1, . . . , n; i = 1, . . . , N is a kth coordinate component of point i, Dkc is a kth coordinate component of a mean
nent of point i, Dkc is a kth coordinate component of a mean center, n is the number of dimensions, N is the number of
center, n is the number of dimensions, and N is the number of points, and wij are point weights as described previously.
points. Similar to the standard deviation, the standard distance - Gettis’s G enables us to examine point patterns at a more lo-
is also strongly influenced by extreme values. Because distances cal scale. Gettis’s G measures overall concentration or lack of
to the mean center are squared, the atypical points have a dom- concentration of all pairs of values (Di Dj ) such that i and j are
inant impact on the magnitude of this metric, which allows de- within distance d of each other. We adjusted the measure to
tecting deviating (incoherent) essay parts; use it in a high-dimensional space:
- relative distance, a descriptive measure of the relative spatial N N
dispersion. We compute it by dividing the standard distance 1

n
i=1 j=1 wi j (d )Dki Dkj
G (d ) = N N (8)
with a measure that describes the area that is covered with n i=1 Dki Dkj
k=1 j=1
points:
where Dki , k = 1, . . . , n; i = 1, . . . , N is a kth coordinate compo-
SD
RD = (5) nent of point i, Dkc is a kth coordinate component of a mean
dmax
center, n is the number of dimensions, N is the number of
points, and d is the average distance between any two points in
Table 3
List of novel semantic attributes. The attributes that are denoted with “(2x)” were computed twice, once using the Euclidean distance and once
using the cosine similarity.
Basic coherence measures Spatial data analysis
1–2 Average distance between neighboring points (2x), 16–17 Average distance between points and centroid (2x),
3–4 Minimum distance between neighboring points (2x), 18–19 Minimum distance between points and centroid (2x),
5–6 Maximum distance between neighboring points (2x), 20–21 Maximum distance between points and centroid (2x),
7–8 Index (minimum distance/maximum distance) (2x), 22–23 Index (minimum distance/maximum distance) (2x),
9–10 Average distance between any two points (2x), 24. Standard distance,
11–12 Maximum difference between any two points (2x), 25. Relative distance,
13. Clark’s and Evans’ distance to nearest neighbor, 26. Determinant of distance matrix,
14. Average distance to nearest neighbor,
15. Cumulative frequency distribution, Spatial autocorrelation
27. Moran’s I,
28. Geary’s C,
29. Getis’s G.
the semantic space. A weighting function wij (d) is used to as-

sign binary weights to every pair of points, where wi j (d ) = 1,
if i and j are within distance d and wi j (d ) = 0, otherwise (illus-
trated in Fig. 2d).
As a result, we extracted 29 different semantic attributes that

are listed in Table 3. We included these attributes with 72 existing
syntactic attributes and evaluated their benefit in Section 5.1.
3.3. Automated error detection system and novel semantic

correctness attributes
In the last years, many researchers have debated [12–14] that

accurately reproducing the human graders is no longer the main
goal of AEE systems. It is desirable that the AEE systems can rec-
ognize certain types of errors, including syntactic errors, and of-
fer automated feedback on correcting these errors. In addition, the
systems shall also provide global feedback on content and devel-
opment. The current limitation of the feedback is that its content
is limited to the syntactic aspect of the essay while neglecting the Fig. 3. Automatic Error Detection (AED) system. The ontology-building (left) part
automatically builds an ontology by combining different ontologies into a base
semantic aspects.. Exceptions are systems [10,11] that include se-
ontology. The extraction (right) part extracts all possible extractions using entity
mantic evaluation of the content, but are not automatic. recognition (ER), coreference resolution (CR) and open information extraction (IE)
In this paper, we propose a fully automatic system that discov- tools. In the final step, the AED system merges extractions, one by one, with the
ers semantic errors and provides a comprehensive feedback. The base ontology. If the logical reasoner determines that new extraction is inconsistent
with the base ontology, it reports a semantic error.
logic of the proposed Automatic Error Detection (AED) system is il-
lustrated in Fig. 3 and described in Algorithm 1. The system starts
by constructing an ontology based on common sense knowledge of basic logically-specified elements (classes, relations, functions,
(everyday universal facts) and supplements it using a source text instances). The ontology is derived from elements in the public
(facts in text about which the students need to write), domain ontologies OpenCyc1 , SUMO2 , BFO3 and DOLCE.4 The COSMO on-
knowledge (facts about a specific domain) and target knowledge tology5 serves as a foundation ontology that has enough funda-
(additional facts about the knowledge that the students are re- mental concept representations so that it can translate assertions
quired to show). At this point we use entity recognition, corefer- from different ontologies into a common terminology and format.
ence resolution and open information extraction to join these com- The COSMO ontology is in the OWL format and contains inference
ponents into a common ontology. The result of open information rules in a form of subclass and subproperty relations and restric-
extraction are triples {< arg1, rel, arg2 >} that describe relation rel tions [64].
between arguments (subjects or objects) arg1 and arg2. The system We use the WordNet taxonomy [65] to add synonyms (gathered
afterwards proceeds by iteratively adding extractions into the on- in synsets) and hypernyms to our ontology. We proceed by supple-
tology and using the Hermit logical reasoner to determine if an on- menting the COSMO ontology with the following:
tology is consistent after adding each extraction. If the consistency-
checking algorithm finds a contradiction in the ontology, it reports 1. Source text knowledge: Our system extracts the knowledge of
a discovered consistency error and includes it in the final feedback. the source text, upon which the essay subject is based. It pro-
In the following subsections we describe each part of our sys- cesses it using steps described in Sections 3.3.1 and 3.3.3. If the
tem, illustrated in Fig. 3, in detail. ontology becomes inconsistent after a new extraction is added,
we detect the error and disregard the extraction.
3.3.1. Construction of the base ontology
Importing the common sense ontology. The system starts 1
http://www.opencyc.org/.
building the base ontology with an ontology that contains the 2
http://www.ontologyportal.org/.
common sense knowledge (also referred to as an upper ontol- 3
http://www.ifomis.uni-saarland.de/bfo/.
ogy). We use the Common Semantic Model (COSMO) ontology [64], 4
http://www.loa-cnr.it/DOLCE.html.
which is made up of a lattice of ontologies that serve as a set 5
http://www.micra.com/COSMO/.
2. Domain knowledge: In addition to source text knowledge, the shown in Algorithm 1. If the ontology is inconsistent or has an un-
system supplements the base ontology with domain knowledge satisfiable concept after adding an extraction, the system concludes
that contains knowledge about the wider scope of an essay in that there is a semantic error in the essay. The system remem-
a form of an ontology, including synonyms and hypernyms. For bers where the error occurred to later provide detailed feedback.
example, if students write an essay about genes and biology, The system then deletes the relation from the ontology and contin-
we add a Gene Ontology.6 ues with the next extraction. Whenever an extraction is processed,
3. Target knowledge: The source text and the domain knowledge the system first looks for both entities (predicates) in the ontol-
represent knowledge about a specific domain. Professors or as- ogy. If the ontology does not yet contain any of them, the Word-
sessors can add specific desired knowledge which they explic- Net [65] taxonomy is used to find synonyms or coreferenced enti-
itly expect the students to express in an essay. The presence of ties within the ontology. If the ontology still does not contain any
the target knowledge in an essay can have an important role synonyms or coreferences, the system looks for hypernyms of the
when grading an essay. Detection of this knowledge can in- entity and creates a subclass (i.e. creating a triplet with subclas-
crease the accuracy of grading and improve the feedback qual- sOf relation). If all described attempts fail, the last alternative is to
ity. create a new class or individual in the ontology (see Section 2.5).
When both entities are included in the ontology, the system first
3.3.2. Processing of the ungraded essay checks if the specific relation is not yet a part of the ontology and
Preprocessing. In the preprocessing phase the system first adds it accordingly.
reads an essay and breaks it into sentences. Then it creates a du-
plicate of each sentence and does several preprocessing steps: tok- Algorithm 1 Automated error detection (AED) system.
enization; part-of-speech tagging; finding and labeling stopwords,
punctuation marks, determiners and prepositions; transformation Input: common sense ontology, domain knowledge, target
to lower-case; and stemming. knowledge, source text, ungraded essay
Entity recognition. Shallow Parsing is the process of identifying Output: error detections
syntactical phrases in natural language sentences. A shallow parser function main(common_sense_ontology, domain_knowledge,
identifies several kinds of phrases (chunks) that are derived from target _knowledge, source_text, ungraded_essay)
parse trees; i.e. noun phrase (NP), verb phrase (VP), prepositional function extract(text, ontology)
phrase (PP), adverb phrase (ADVP), and clause introduced by sub- Preprocessing
ordinating conjunction (SBAR). These chunks provide an interme- Entity recognition
diate step to natural language understanding. Although identifying Coreference resolution
whole parse trees can provide deeper analyses of the sentences, it Open information extraction
is a much harder problem [50]. for sentence in text do
Our system uses the Illinois Shallow Parser [50] to determine for relation in sentenceRelations do
chunks which we can later use for coreference resolution, search- Add to ontology
ing for a suitable chunk when connecting extractions with a parts HermiT check
of sentences, and matching chunks with individuals, classes and print er ror s
relations within the ontology. end for
Coreference resolution. A given entity - representing a person,
end for
a location, or an organization - can be mentioned in text in mul-
return (ontology, errors)
tiple, ambiguous ways. Understanding natural language and sup-
end function
porting intelligent access to textual information requires identify-
ing whether different entity mentions are actually referencing the
same entity [51]. The coreference resolution processes unannotated
ontology ← common_sense_ontology
essay text and shows which mentions are coreferential. (ontology, ) ← extract(source_text, ontology)
Our system uses two different coreference resolution sys- ontology ← ontology + domain_knowledge
tems (the Illinois Coreference Resolution [66] and the Stanford ontology ← ontology + target _knowledge
Parser [67]) to detect coreferences in an essay and use them when ( , er ror s) ← extract(ungraded_essay, ontology)
adding extractions to the ontology. The system combines corefer- return er ror s
ences discovered by both systems and thus increases the accuracy end function
of discovered coreferences.
Open information extraction. After the above phases, the sys-
tem performs information extraction using four systems and re-
turns triples, as we have described in Section 2.4. Within the pro-
cess, the duplicate extractions are removed, as well as the faulty 3.3.4. Semantic error detection attributes
extractions (e.g. those consisting only of a subject and a relation, Based on detections of semantic errors in an essay that were
while an object is missing). detected by the logic reasoner (as explained in the previous sec-
After all of the previously described phases, the system starts tion) we implemented three error attributes:
to process each sentence sequentially and adds each extraction to
the ontology by utilizing the logic reasoner. 1. number of unsatisfiable cases when adding classes and indi-
viduals in the ontology,
3.3.3. Logic reasoner 2. number of inconsistency errors after adding a triple to ontol-
After obtaining the base ontology and extractions from the es- ogy,
say, we can start discovering semantic errors. To achieve this, we 3. total number of consistency errors (sum of the first two at-
use Hermit, the logic reasoner [55] (described in Section 2.5), as tributes).
In the remaining sections we proceed to evaluating the benefits

6
http://geneontology.org/. of the proposed attributes and displaying the results.
Table 4
Properties of essay datasets.
Characteristic DS1 DS2 DS3 DS4 DS5 DS6 DS7 DS8
Type of essay Persuasive Persuasive Source-based Source-based Source-based Source-based Expository Narrative
Number of essays 1783 1800 1726 1771 1805 1800 1569 723
Training Mean number of words 366.40 381.19 108.69 94.39 122.29 153.64 171.28 622.13
set SD of number of words 120.40 156.44 53.3 51.68 57.37 55.92 85.2 197.08
Range of grades 2–12 1–6 1–4 0–3 0–3 0–4 0–4 0–24 0–60
Mean grade 8.53 3.42 3.33 1.85 1.43 2.41 2.72 19.98 37.23
Test set Number of essays 589 600 568 586 601 600 441 233
Mean number of words 368.96 378.4 113.24 98.7 127.17 152.28 173.48 639.05
SD of number of words 117.99 156.82 56.0 53.84 57.59 52.81 84.52 190.13
Range of grades 2–12 1–6 1–4 0–3 0–3 0–4 0–4 0–24 0–60
Mean grade 8.62 3.41 3.32 1.9 1.51 2.51 2.75 20.13 36.67
SD = standard deviation
4. Implementation and evaluation by two different graders (e.g. human/computer), the metric is
calculated as follows:
We extracted the proposed attributes from the text by using the
wi, j Oi, j
Natural Language Toolkit (NLTK) [68] for natural language process- κ = 1 − i, j (9)
wi, j Ei, j
ing in Python and a spellchecking library PyEnchant.7 i, j
where w are weights, O is a matrix of observed ratings and E is

4.1. Essay datasets a matrix of expected ratings. The matrix of weights wij is an S-
by-S matrix that is calculated based on the difference between
We performed the experiments on datasets that were provided graders’ scores, such that
within the Automated Essay Scoring competition on the Kaggle
website.8 The datasets contain student essays for eight different ( i − j )2
wi, j = . (10)
prompts (essay discussion questions). The anonymized students ( S − 1 )2
were from the USA and were drawn from three different grade lev- The matrix of observed ratings O, which is an S-by-S histogram
els: 7, 8, and 10 (aged 12, 13, and 15, respectively). Four datasets (agreement) matrix, is constructed over the essay ratings, such
included essays of traditional writing genres (persuasive, exposi- that Oi, j corresponds to the number of essays that received a
tory, narrative) and the other four were source based (i.e., the stu- rating i by grader A and a rating j by grader B; analogously, E is
dents had to discuss questions referring to a previously read source an S-by-S histogram matrix of expected ratings:
document). Each training set was pre-scored by at least two hu-
man expert graders. Since Dataset 2 was scored using two differ- HAi · HB j
Ei, j = (11)
ent criteria, it appears as two separate datasets 2a (scored with an N
emphasis on writing applications) and 2b (scored with an empha- where HAi , i = 1, . . . , S denotes the number of essays that grader
sis on language skills) in the tables with the results. A scored with score i, and N is a number of gradings or essays.
The authors of the datasets already divided them into fixed E is normalized with N such that E and O have the same sum.
training and test sets. We used the same training and test sets
to build scoring models and measure prediction accuracy, re- To compare the significance of the difference between two
spectively. The characteristics of the used datasets are shown in quadratic weighted Kappas, we used the Wilcoxon signed-rank
Table 4. test. This is a non-parametric statistical test that is used when
comparing repeated measurements on a single sample to assess
4.2. Evaluation whether their population mean ranks differ. The test assumes that
data are paired, come from the same population and are not nec-
For evaluating the performance of the prediction models we essarily normally distributed [69].
used the following measures:
- the exact agreement measure that is defined as the percentage 4.3. Prediction model
of essays that were graded equally by human grader and AEE
system, We experimented with linear regression, regression trees, neu-
- the quadratic weighted Kappa, which is an error metric that ral network, random forest, and extremely randomized trees to
measures the degree of agreement between two graders (in predict the final grade. Table 5 shows the results of the Kappa met-
case of AEE this is an agreement between the automated scores ric for each classifier. Since the results showed that the random
and the resolved human scores) and is an analogy to the cor- forest and extremely randomized trees [70] achieved the highest
relation coefficient. This metric typically ranges from 0 (ran- performance, we decided to use them as essay grade predictors
dom agreement between graders) to 1 (complete agreement be- in further evaluation. Their key properties and parameters were as
tween graders). In case that there is less agreement between follows:
the graders than expected by chance, this metric may go be-
- Random forest: we used the “randomForest” package9 in R; 100
low 0. Assuming that a set of essay responses E has S different
trees; sampling of cases is done with replacement; the number
possible ratings, 1, 2, . . . , S, and that each essay received scores
of attributes randomly sampled as candidates at each split is
|at t ributes|
3 ;
7
https://pythonhosted.org/pyenchant/.
8
Access to data can be requested through the Kaggle website http://www.kaggle.
9
com/c/asap-aes/data or ASAP website http://www.scoreright.org/. http://cran.r-project.org/web/packages/randomForest/randomForest.pdf.
Table 5
Comparison of different regression models: linear regression (LR), regression trees (RT), neural network (NN), random forest
(RF), and extremely randomized trees (ERT).
Model DS1 DS2a DS2b DS3 DS4 DS5 DS6 DS7 DS8 Average
LR 0.8359 0.7232 0.5175 0.6535 0.7090 0.7900 0.7663 0.7781 0.7785 0.7280
RT 0.8070 0.6943 0.4885 0.6620 0.7113 0.7828 0.7184 0.7323 0.7289 0.7028
NN 0.8247 0.6964 0.4883 0.6328 0.6877 0.7776 0.7428 0.7601 0.7247 0.7039
RF 0.8447 0.7389 0.5386 0.6591 0.7174 0.7949 0.7636 0.7888 0.7738 0.7356
ERT 0.8434 0.7439 0.5384 0.6554 0.7148 0.7967 0.7670 0.7882 0.7807 0.7365
- Extremely randomized trees: a model similar to random for- tion rubric also describes the coherence, we decided to further in-
est, but it uses the same data to train all trees in a set and vestigate how well our proposed coherence attributes predict the
chooses splitting nodes randomly among variables. We used the organization rubric score. In this experiment, we prepared datasets
“extraTrees” package10 in R; the ensemble contained 100 trees; with three different sets of attributes: (a) coherence attributes only
the number of attributes tried at each node was |at t ributes
3
|
; the (29), (b) syntax (linguistic and content) attributes only (72), and
number of random cuts for each (randomly chosen) attribute (c) syntax and coherence attributes (101). Table 7 shows the re-
was 1 (default), which corresponds to the official ExtraTrees sults using quadratic weighted Kappa and p-values. Based on the
method; cutting thresholds are uniformly sampled. high influence of the number of characters and words on the fi-
nal grade, we expected high prediction accuracy already by using
Since both models continued to achieve similar results in the the set of syntax attributes only. Nevertheless, by adding the set
following experiments, we report only the results for the random of coherence attributes to the set of syntax attributes, the accuracy
forest model. additionally increased. We can also see that the set of coherence
attributes alone also achieved relatively high prediction accuracy,
5. Results which enabled us to conclude in favor of their benefit.
We additionally also calculated the Spearman coefficients be-
To analyze the potential benefits of the proposed attributes, we tween the coherence attributes and the organization rubric score.
first evaluated their relevance and contribution to predictive accu- Getis’s G and Moran’s I achieved the highest absolute correlations
racy. We proceed by comparing predictive accuracy of three dif- with 0.5947 and 0.5752, respectively (p-values < 0.001). Overall 19
ferent versions of our AEE system and continue by comparing the of 29 coherence attributes correlate with the organization rubric
best system to other state-of-the-art systems. score with p-value smaller than 0.05.
5.1. Evaluation of the implemented attributes 5.2. Accuracy of the semantic-based AEE system
As described in Section 3, we implemented 104 existing and In our third experiment, we compared three versions of our
novel attributes (72 linguistic and content attributes, 29 coherence system to evaluate if semantic attributes yield to better model per-
attributes, and 3 consistency attributes). To improve model inter- formance:
pretability, achieve shorter training times and enhance generaliza-
tion by reducing overfitting, we performed attribute selection to 1. AGE: The system with only linguistic and content attributes
detect redundant and irrelevant attributes. Attribute selection was (described in Section 3.1),
performed using the forward attribute selection approach. Start- 2. AGE+: system AGE, augmented with additional coherence at-
ing with an empty set of attributes, an attribute that improves the tributes (described in Section 3.2),
model performance the most (measured by the quadratic weighted 3. SAGE: system AGE+, augmented with additional consistency at-
Kappa measure) was included into the set in each iterative step. tributes (described in Section 3.3).
The procedure was terminated when there were no more at-
Since the system SAGE requires source-based essays to build an
tributes that improved the model performance. The model perfor-
ontology for the logic reasoner, we were able to evaluate it only
mance was measured using the internal ten-fold cross-validation,
on datasets that include source-based essays (such datasets are 3,
and the best attribute in each step was selected according to the
4, 5, and 6).
highest average Kappa value among all folds.
Table 8 shows the quadratic weighted Kappas and exact agree-
The ranks of the 50 most relevant attributes, averaged across
ment for AGE and AGE+. The results show that the prediction
all data sets, are shown in Table 6 in decreasing order of the aver-
accuracy significantly (p-value< 0.05) improves on 8 out of 9
age rank. From the ranking we can see that the number of charac-
datasets when the coherence attributes are used in the system.
ters and words influence the final grade most, as well as the score
The comparison of the average results in the rightmost column of
point level that uses cosine similarity between already graded es-
Table 8 shows that there is also a significant difference between
says and a new essay. We can observe that some of the pro-
both systems over all datasets.
posed coherent attributes rank high among other attributes: Getis’s
Table 9 shows that consistency attributes helped achieve higher
G (13.), determinant of distance matrix (16.), minimum distance
Kappa values on all four observed datasets. However, the improve-
between neighboring points (Euclid) (21.), relative distance (22.),
ments were significant only on two out of 4 datasets, and not
Clark’s and Evans’ distance to nearest neighbor (23.), and Moran’s I
on the average (the rightmost column). We nevertheless argue (in
(24.). The highest ranked proposed content attribute – sum of con-
Section 6) that SAGE contributes another valuable benefit – a feed-
sistency errors – is in the 37. place.
back for students.
Dataset 8, in addition to the final score, provides 6 rubric scores
describing ideas and content, organization, voice, word choice, sen-
5.3. Comparison with the state-of-the-art AEE systems
tence fluency, and convention for each essay. Since the organiza-
We also compared the proposed system SAGE with the state-of-

10
http://cran.r-project.org/web/packages/extraTrees/extraTrees.pdf. the-art systems that were used in a previous study [26] at the end
Table 6
Average ranks of 50 most relevant attributes within all 104 attributes. The ranks were obtained using the forward attribute selection.
Attribute Average rank
1. Number of characters 1.3333

2. Number of words 16.4444
3. Score point level for maximum cosine similarity over all score points 19.7778
4. Number of superlative adjectives 20.7778
5. Number of predeterminers 23.1111
6. Average word length 25.4444
7. Number of existential there’s 26.2222
8. Number of genitive markers 29.4444
9. Number of superlative adverbs 30.0 0 0 0
10. Most frequent word length 30.3333
11. Number of coordinating conjunctions 35.0 0 0 0
12. Number of adjectives 35.5556
13. Getis’s G 35.8889
14. Number of comparative adjectives 36.0 0 0 0
15. Number of modal auxiliaries 36.2222
16. Determinant of distance matrix 36.6667
17. Number of comparative adverbs 39.3333
18. Number of short sentences 40.2222
19. Number of different PoS tags 40.5556
20. Correct verb form 40.6667
21. Minimum distance between neighboring points (Euclid) 42.1111
22. Relative distance 42.5556
23. Clark’s and Evans’ distance to nearest neighbor 42.5556
24. Moran’s I 43.3333
25. Number of determiners 43.6667
26. Number of adverbs 43.8889
27. Number of wh-determiners 43.8889
28. Cumulative frequency distribution 44.3333
29. Number of capitalization errors 45.2222
30. Number of short words 45.5556
31. Index (minimum distance/maximum distance) (Cos) 46.3333
32. Number of spellchecking errors 46.7778
33. Average distance between any two points (Cos) 46.8889
34. Index (minimum distance/maximum distance) (Euclid) 47.1111
35. Maximum distance between neighboring points (Euclid) 47.5556
36. Index (minimum distance/maximum distance) (Cos) 47.7778
37. Sum of consistency errors 48.2222
38. Number of wh-pronouns 48.2222
39. Maximum distance between neighboring points (Cos) 48.7778
40. Average distance between any two points (Euclid) 49.1111
41. Number of possessive pronouns 49.5556
42. Number of preposition/subordinating conjunctions 49.6667
43. Unsatisfiable count - semantic error attribute 49.8889
44. Number of verbs - base form 49.8889
45. Simple measure of Gobbledygook 50.2222
46. Maximum distance between points and centroid (Euclid) 50.2222
47. Number of wh-adverbs 50.3333
48. Maximum difference between any two points (Cos) 51.3333
49. Minimum distance between points and centroid (Cos) 51.6667
50. Yule’s K 52.4444
Cos = Cosine Similarity

Euclid = Euclidean distance
Table 7
Comparison of the prediction accuracy for the organization rubric, which also measures coherence. Prediction models were built using three
different sets of attributes: (a) coherence attributes, (b) linguistic and content attributes, and (c) linguistic, content, and coherence attributes.
The left table contains the quadratic weighted Kappas and shows the accuracy for each model. The right table shows the significant difference
between the models using p-values.
of 2012: PEG, e-rater, IntelliMetric, CRASE, LightSIDE, AutoScore, above listed systems capture over 97% of the current automated
IEA, Bookette, Lexile Writing Analyzer, with a ranked-based ap- scoring market in the USA [26].
proach [28] and with results obtained by researchers participat- Tables 10 and 11 show the results that were calculated be-
ing in the before mentioned Automated Essay Scoring competition tween the automated and human scores (resolved score of more
on the Kaggle website. The eight commercial systems among the human graders). Since not all the systems are available for
public experimenting, their results were obtained from the pa-
Table 8
Comparison of the system AGE (syntactic attributes only) and the system AGE+ (with additional coherence attributes) using quadratic weighted
Kappa (1st row) and exact agreement (2nd row), p-values are computed for Kappas.
System DS1 DS2a DS2b DS3 DS4 DS5 DS6 DS7 DS8 Average
QW Kappa 0.9045 0.7473 0.6619 0.8096 0.8040 0.8701 0.7736 0.8760 0.7851 0.8036
AGE
Exact agg. 0.7224 0.7716 0.7379 0.7886 0.7237 0.7847 0.7314 0.2607 0.2219 0.6381
QW Kappa 0.9251 0.7924 0.6714 0.8272 0.8109 0.8729 0.7817 0.8814 0.8050 0.8187
AGE+
Exact agg. 0.7507 0.8057 0.7481 0.8036 0.7375 0.7805 0.7400 0.2627 0.1577 0.6430
p-value < 0.001a < 0.001a 0.0416a 0.0116a 0.0398a 0.0205a < 0.001a 0.0201a 0.0570 0.0083a
a
p-value< 0.05
Table 9
Comparison of the systems AGE (syntactic attributes only), AGE+ (syntactic and coherence attributes)
and SAGE (syntactic, coherence and consistency attributes) on source-based datasets using quadratic
weighted Kappa (1st row) and exact agreement (2nd row), p-values are computed for Kappas.
System DS3 DS4 DS5 DS6 Average
QW Kappa 0.8096 0.8040 0.8701 0.7736 0.8143

AGE
Exact agg. 0.7886 0.7237 0.7847 0.7314 0.7886
QW Kappa 0.8272 0.8109 0.8729 0.7817 0.8232

AGE+
Exact agg. 0.8036 0.7375 0.7805 0.7400 0.8036
QW Kappa 0.8340 0.8120 0.8791 0.7880 0.8283

SAGE
Exact agg. 0.8100 0.7302 0.7962 0.7353 0.8100
p-value AGE - SAGE 0.0086a 0.0349a 0.0071a < 0.001a 0.0078a

p-value AGE+ - SAGE 0.0549 0.1312 0.0174a 0.0073a 0.0750
a
p-value< 0.05
Table 10
Comparison of the proposed semantic grading system SAGE with other state-of-the-art systems. The table shows
quadratic weighted Kappas, achieved on different datasets. Significant values (p < 0.05) are marked with an asterisk.
System DS1 DS2a DS2b DS3 DS4 DS5 DS6 DS7 DS8 Average
SAGE 0.93 0.79 0.67 0.83 0.81 0.89 0.79 0.88 0.81 0.83
PEG 0.82a 0.72a 0.70 0.75a 0.82 0.83a 0.81 0.84a 0.73 0.79
e-rater 0.82a 0.74a 0.69 0.72a 0.80 0.81a 0.75 0.81a 0.70a 0.77a
IntelliMetric 0.78a 0.70a 0.68 0.73a 0.79 0.83a 0.76 0.81a 0.68a 0.76a
CRASE 0.76a 0.72a 0.69 0.73a 0.76a 0.78a 0.78 0.80a 0.68a 0.75a
LightSIDE 0.79a 0.70a 0.63 0.74a 0.81 0.81a 0.76 0.77a 0.65a 0.75a
ranked-based 0.81a 0.68a 0.68 0.67a 0.73a 0.80a 0.72a 0.77a 0.71a 0.74a
AutoScore 0.78a 0.68a 0.66 0.72a 0.75a 0.82a 0.76 0.67a 0.69a 0.73a
IEA 0.79a 0.70a 0.65 0.65a 0.74a 0.80a 0.75 0.77a 0.69a 0.73a
Bookette 0.70a 0.68a 0.63 0.69a 0.76a 0.80a 0.64a 0.74a 0.60a 0.70a
Lexile 0.66a 0.62a 0.55a 0.65a 0.67a 0.64a 0.65a 0.58a 0.63a 0.63a
a
p-value< 0.05
Table 11 pers [26] and [28], and from the Kaggle website.11 Results reported
Accuracy comparison of various systems from the literature and in [26] and [28] include Kappa values for every data set and are
results from the Kaggle competition.
reported in Table 10 together with results of our system. The eval-
System Avg. acc. Rank uated systems are sorted in descending order of the average Kappa
SAGE 0.8325 1 value, which is shown in the rightmost column of Table 10. Since
Sollers & Gxava 0.8014 2 dataset 2 has scores in two different domains, each transformed
SirGuessalot & PlanetThanet & Stefana 0.7986 3 Kappa is weighted by 0.5. The Wilcoxon non-parametric test was
VikP & jmana 0.7978 4 used to compute p-values that express the significance of differ-
Efimov+Berengueresa 0.7956 5
ences between each evaluated system and the SAGE system. We
@ORGANIZATIONa 0.7947 6
PEG [21] 0.7888 7 can see that our system achieves significantly better results on 5
Martina 0.7857 8 out of 9 datasets (DS1, DS2a, DS3, DS5, DS7). On the remaining
cs224ua 0.7828 9 four datasets, accuracy of SAGE was statistically insignificantly dif-
jackpot (Jason)a 0.7826 10
ferent than the accuracy of the best performing system, while still
e-rater [16] 0.7656 11
IntelliMetric [18] 0.7588 12
significantly better compared with some of the systems. On the
CRASE [25] 0.7494 13 average (the rightmost column), SAGE achieved significantly bet-
LightSIDE [32] 0.7494 14 ter results than 9 out of 10 other systems. We have also compared
Ranked-based [28] 0.7363 15 SAGE with results obtained from the leader board of Automated
AutoScore [26] 0.7325 16
Essay Scoring competition. In Table 11 we ranked 8 commercial
IEA [17] 0.7344 17
Bookette [23] 0.6981 18 systems, 8 leading systems from the competition, LightSide [32],
Lexile [27] 0.6331 19 ranked-based system [28] and SAGE. The results are reported in
a
Results were obtained from the leader board of AES com-
petition on Kaggle website11 . 11
http://www.kaggle.com/c/asap-aes/data.
the form of the average Kappas over all datasets, since the accu- more equal relations with different classes (e.g. like or isA relation)
racy of 8 leading systems on the Kaggle website is reported like as long as these classes are not disjoint (e.g. Lisa can be a girl and
that. a student).
6. Providing automated feedback
One of the main advantages of the system SAGE is that it

provides comprehensive and informative feedback about syntactic
Fig. 5. The system reports an error with a reference to a relation in the ontology
and semantic errors. When it detects an error, it reports a pair
that contradicts the extraction from a written sentence. The system detected an
of conflicting relations in the ontology to the student. The auto- error by considering synonyms and coreferences.
mated error detection is based on the logic reasoner, which we
briefly evaluate in the next subsection and follow by showing some The third example (in Fig. 6) represents a more complex ex-
examples. ample as a combination of three sentences. First two sentences:
“Lisa likes slow sports and doesn’t like quick sports.” and “Tennis
6.1. Performance of automated error detection system is a quick sport.” do not initiate an error. When a student writes a
sentence “Lisa likes tennis.” the system returns an error. As men-
To preliminarily evaluate the proposed automated error detec- tioned before in the ontology, word “tennis” is a subclass of “sport”
tion system, we constructed an artificial dataset consisting of 50 and relations “like” and “not like” are disjoint. Likewise, the classes
sentences describing a girl named Lisa. We manually labelled sen- “slow” and “quick” are disjoint hypernyms, based on which the
tences as correct and incorrect to denote the ground truth, hav- system is able to detect an error and return the feedback.
ing 36 correct and 14 incorrect sentences. As an input to the au-
tomated error detection system we used only a common sense 7. Discussion and conclusion
ontology and were aiming to measure how effectively the system
detects incorrect sentences. The proposed automated essay evaluation system SAGE incor-
We measured sensitivity and specificity of our system, where porates additional semantic attributes to improve the prediction
the sensitivity expresses the proportion of incorrect sentences that accuracy and provide meaningful feedback. The proposed novel at-
are correctly identified as such, and the specificity measures the tributes measure the coherence and consistency of a text. The co-
proportion of correct sentences that are correctly identified as herence attributes are based on vocabulary changes through the
such. By running the experiment, we obtained 100% specificity and text, and the consistency attributes are derived from the output
71.4% sensitivity. The 100% specificity was expected, since the sys- of an ontology-based logic reasoner. Reader should notice that our
tem treats each sentence as correct unless it detects an error in system provides feedback about semantics of the essay that is ob-
the sentence. The sensitivity shows that there is still a room for tained automatically.
improving successful detection of incorrect sentences. This subject By comparing SAGE to 10 other state-of-the-art systems we
shall be the focus of our further work, especially improving detec- achieved better results on 8 out of 9 different data sets. The only
tion of ambiguous sentences and sentences that require reasoning outperforming system is PEG [71], of which technological details
by including many different relations in the ontology. are not public. However, the noticeable benefit of SAGE is that it
provides the automated detection of semantic errors and feedback.
6.2. Examples of the provided feedback Such semantic feedback provides a meaningful explanation by sug-
gesting improvements that can lead to faster improvement of stu-
Fig. 4 displays a simple example of a student who was writ- dents’ writing skills.
ing about a girl Lisa. When he wrote that Lisa is a boy, our sys- An interesting finding that confirms the results of other re-
tem detected an error and reported it in the form of feedback. The searchers [56] is the influence of number of words and characters
system discovered the error because the classes “boy” and “girl” on the final grade. We agree that there is a strong correlation be-
are subclasses of “female” and “male” classes, respectively, that are tween the length of the essay and the final score, but we argue
disjoint hypernyms, but are yet taken into account by SAGE. that the purpose of the AEE systems is not only to score an essay
but also to provide a meaningful and constructive feedback, pro-
viding a holistic feedback for each essay.
The open challenges for our future work include further devel-
opment of different semantic attributes and improvement of feed-
back. We shall also test other approaches than TF-IDF for trans-
Fig. 4. The system for automatic error detection provides immediate feedback to forming text into attribute space, to discover how the alternatives
a student by reporting that a relation in the ontology contradicts the extraction
from the written sentence. This example shows successful error detection in case
impact the results. We plan to use other external sources to deter-
of disjoint hypernyms. mine if statements in an essay are true and consistent. One of the
future challenges is to develop and incorporate new approaches for
Fig. 5 provides a second example in which the student first unsupervised taxonomy learning, since our current approach uses
wrote that Lisa does not like sports and later that she likes tennis. only WordNet as the underlying taxonomy. The second goal is to
In the ontology, word “tennis” is a subclass of “sport” and relations incorporate inference rules that will also help detect implicit er-
“like” and “not like” are disjoint. Coreference resolution detected rors and facts/relations that are not explicitly written in an essay.
that she and Lisa are synonyms, so the system recognizes an er- Development of the AEE field is of a great importance to teach-
ror. Notice that an ontology and the system allow that a class has ers and students. It can not only help reduce teachers’ load but can
Fig. 6. The system detects disjointness of hypernyms and reports an error with a reference to a relation in the ontology that contradicts the extraction from a written
sentence.
also helps students become more autonomous during their learn- [19] E. Mayfield, C. Penstein-Rosé, An interactive tool for supporting error analysis
ing process. Systems with feedback can be an aid, not a replace- for text mining, in: Proceedings of the NAACL HLT 2010 Demonstration Ses-
sion, 2010, pp. 25–28. Los Angeles, CA
ment, for classroom instruction and can help students to achieve [20] Y. Chali, S.A. Hasan, On the effectiveness of using syntactic and shallow
progress faster. Students can use SAGE in the classroom as well semantic tree kernels for automatic assessment of essays, in: Proceedings
as at home, while learning. Feedback for each specific response re- of the International Joint Conference on Natural Language Processing, 2013,
pp. 767–773. Nagoya, Japan
turned by our system provides information on the quality of differ- [21] E.B. Page, Computer grading of student prose , using modern concepts and
ent aspects of writing, a score and a descriptive feedback. The sys- software, J. Exp. Educ. 62 (2) (1994) 127–142.
tem’s constant availability for scoring gives a possibility to students [22] O. Mason, I. Grove-Stephenson, Automated free text marking with paperless
school, in: Proceedings of the Sixth International Computer Assisted Assess-
to repetitively practice their writing at any time. SAGE is consistent
ment Conference, 2002, pp. 213–219.
as it predicts the same score for a single essay each time that essay [23] C.S. Rich, M.C. Schneider, J.M. D’Brot, Applications of automated essay evalua-
is input to the system. This is important since the scoring consis- tion in West Virginia, in: M.D. Shermis, J. Burstein (Eds.), Handbook of Auto-
mated Essay Evaluation: Current Applications and New Directions, Routledge,
tency between prompts turned out to be one of the most difficult
New York, 2013, pp. 99–123.
psychometric issues in human scoring [13]. [24] A. Fazal, T. Dillon, E. Chang, Noise reduction in essay datasets for automated
Advantages of automated feedback are its anonymity, instanta- essay grading, Lect. Notes Comput. Sci. 7046 (2011) 484–493.
neousness, and encouragement for repetitive improvements by giv- [25] S.M. Lottridge, E.M. Schulz, H.C. Mitzel, Using automated scoring to monitor
reader performance and detect reader drift in essay scoring., in: M.D. Shermis,
ing students more practice in writing essays [72]. By publicly pro- J. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Applica-
viding the technical details and results of our AEE system, we also tions and New Directions, Routledge, New York, 2013, pp. 233–250.
aim to promote the openness of this research field. Hopefully, this [26] M.D. Shermis, B. Hamner, Contrasting state-of-the-art automated scoring of es-
says: analysis, in: M.D. Shermis, J. Burstein (Eds.), Handbook of Automated Es-
shall open opportunities to progress and help bring more AEE sys- say Evaluation: Current Applications and New Directions, Routledge, New York,
tems into practical applications. 2013, pp. 313–346.
[27] M.I. Smith, A. Schiano, E. Lattanzio, Beyond the classroom., Knowl. Quest 42
(3) (2014) 20–29.
References [28] H. Chen, B. He, T. Luo, B. Li, A ranked-based learning approach to automated
essay scoring, in: Proceedings of the Second International Conference on Cloud
[1] M.D. Shermis, J. Burstein, Introduction, in: M.D. Shermis, J. Burstein (Eds.), Au- and Green Computing, Ieee, 2012, pp. 448–455.
tomated essay scoring: a cross-disciplinary perspective, Lawrence Erlbaum As- [29] L. Bin, Y. Jian-Min, Automated essay scoring using multi-classifier fusion, Com-
sociates, Manwah, NJ, 2003, pp. xiii–xvi. mun. Comput. Inf. Sci. 233 (2011) 151–157.
[2] E.B. Page, The imminence of...grading essays by computer, Phi Delta Kappan 47 [30] L.M. Rudner, T. Liang, Automated essay scoring using Bayes’ theorem, J. Tech-
(5) (1966) 238–243. nol. Learn. Assess. 1 (2) (2002) 3–21.
[3] T.K. Landauer, P.W. Foltz, D. Laham, An introduction to latent semantic analysis, [31] J.R. Christie, Automated essay marking - for both style and content, in: Pro-
Discourse Process. 25 (2–3) (1998) 259–284. ceedings of the Third Annual Computer Assisted Assessment Conference, 1999.
[4] T. Kakkonen, N. Myller, E. Sutinen, J. Timonen, Comparison of dimension re- [32] E. Mayfield, C. Rosé, LightSIDE: open source machine learning for text, in:
duction methods for automated essay grading, Educ. Technol. & Soc. 11 (3) M.D. Shermis, J. Burstein (Eds.), Handbook of Automated Essay Evaluation:
(2008) 275–288. Current Applications and New Directions, Routledge, New York, 2013,
[5] Y. Attali, A Differential Word Use Measure for Content Analysis in Automated pp. 124–135.
Essay Scoring, ETS Research Report Series, 36, 2011. [33] M.M. Islam, A.S.M.L. Hoque, Automated essay scoring using generalized latent
[6] P.W. Foltz, W. Kintsch, T.K. Landauer, The measurement of textual semantic analysis, J. Comput. 7 (3) (2012) 616–626.
coherence with latent semantic analysis, Discourse Process. 25 (2–3) (1998) [34] R. Williams, H. Dreher, Automatically grading essays with markitÂl’, Issues Inf.
285–307. Sci. Inf. Technol. 1 (2004) 693–700.
[7] P.W. Foltz, Discourse coherence and LSA, in: T.K. Landauer, D.S. McNamara, [35] M.A. Hearst, TextTiling: segmenting text into multi-paragraph subtopic pas-
S. Dennis, W. Kintsch (Eds.), Handbook of Latent Semantic Analysis, Lawrence sages, Comput. Ling. 23 (1) (1997) 33–64.
Erlbaum Associates, Inc., Mahwah, New Jersey, 2007, pp. 167–184. [36] E. Miltsakaki, K. Kukich, Automated evaluation of coherence in student essays,
[8] D. Higgins, J. Burstein, D. Marcu, C. Gentile, Evaluating multiple aspects of co- in: Proceedings of LREC-20 0 0, Linguistic Resources in Education Conf., Athens,
herence in student essays, in: Proceedings of HLT-NAACL, 2004. Boston, MA Greece, 20 0 0, 20 0 0, pp. 140–147.
[9] J. Burstein, J. Tetreault, S. Andreyev, Using entity-based features to model co- [37] B.J. Grosz, A.K. Joshi, S. Weinstein, Centering : a framework for modelling the
herence in student essays, in: Human Language Technologies: The 2010 Annual local coherence of discourse, Comput. Ling. 21 (2) (1995) 203–226.
Conference of the North American Chapter of the ACL, Association for Compu- [38] J.C. Burstein, J.R. Tetreault, M. Chodorow, D. Blanchard, S. Andreyev, Automated
tational Linguistics, Los Angeles, California, 2010, pp. 681–684. evaluation of discourse coherence quality in essay writing, in: M.D. Shermis,
[10] F. Gutiererz, D. Dou, S. Fickas, G. Griffiths, Online reasoning for ontology-based J.C. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Appli-
error detection in text, in: On the Move to Meaningful Internet Systems: cations and New Directions, Routledge, New York, 2013, pp. 267–280.
OTM 2014 Conferences Lecture Notes in Computer Science, 8841, 2014, [39] R. Barzilay, M. Lapata, Modeling local coherence : an entity-based approach,
pp. 562–579. Comput. Ling. 34 (1) (2008) 1–34.
[11] E. Brent, C. Atkisson, N. Green, Time-shifted collaboration: creating teach- [40] F. Gutierrez, D.C. Wimalasuriya, D. Dou, Using information extractors with the
able moments through automated grading, in: A. Juan, T. Daradournis, S. Ca- neural electromagnetic ontologies, in: Proceedings of the 2011th Confeder-
balle (Eds.), Monitoring and Assessment in Online Collaborative Environments: ated International Conference on On the Move to Meaningful Internet Systems
Emergent Computational Technologies for E-learning Support, IGI Global, 2010, (OTM’11), 2011, pp. 31–32.
pp. 55–73. [41] F. Gutierrez, D. Dou, S. Fickas, G. Griffiths, Providing grades and feedback for
[12] I.I. Bejar, A validity-based approach to quality control and assurance of auto- student summaries by ontology-based information extraction, in: Proceedings
mated scoring, Assess. Educ. 18 (3) (2011) 319–341. of the 21st ACM International Conference on Information and Knowledge Man-
[13] Y. Attali, Validity and reliability of automated essay scoring, in: M.D. Shermis, agement - CIKM ’12, 2012, pp. 1722–1726.
J.C. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Appli- [42] F. Gutierrez, D. Dou, A. Martini, S. Fickas, H. Zong, Hybrid ontology-based
cations and New Directions, Routledge, New York, 2013, pp. 181–198. information extraction for automated text grading, in: Proceedings of 12th
[14] D.M. Williamson, X. Xi, F.J. Breyer, A framework for evaluation and use of au- International Conference on Machine Learning and Applications, 2013,
tomated scoring, Educ. Meas. 31 (1) (2012) 2–13. pp. 359–364.
[15] M.D. Shermis, J. Burstein, S.A. Bursky, Introduction to automated essay evalua- [43] D.C. Wimalasuriya, D. Dou, Ontology-based information extraction: an intro-
tion, in: M.D. Shermis, J. Burstein, S.A. Bursky (Eds.), Handbook of Automated duction and a survey of current approaches, J. Inf. Sci. 36 (3) (2010) 306–323.
Essay Evaluation: Current Applications and New Directions, Routledge, New [44] F. Wu, D.S. Weld, Open information extraction using wikipedia, in: Proceedings
York, 2013, pp. 1–15. of the 48th Annual Meeting of the Association for Computational Linguistics,
[16] J. Burstein, J. Tetreault, N. Madnani, The E-raterÂő automated essay scoring 2010, pp. 118–127.
system, in: M.D. Shermis, J. Burstein (Eds.), Handbook of Automated Essay [45] P. Gamallo, An overview of open information extraction, in: M.J.A.V. Pereira,
Evaluation: Current Applications and New Directions, Routledge, New York, J.P. Lea, A. Simões (Eds.), Invited Talk at the 3rd Symposium on Languages,
2013, pp. 55–67. Applications and Technologies, SLATE14., 2014, pp. 13–16.
[17] P.W. Foltz, L.A. Streeter, K.E. Lochbaum, T.K. Landauer, Implementation and ap- [46] https://www.google.com/patents/US20140032209.
plications of the intelligent essay assessor, in: M.D. Shermis, J. Burstein (Eds.), [47] L.D. Corro, R. Gemulla, ClausIE: clause-based open information extraction, in:
Handbook of Automated Essay Evaluation: Current Applications and New Di- Proceedings of the 22nd International Conference on World Wide Web, Rio de
rections, Routledge, New York, 2013, pp. 68–88. Janeiro, Brazil, 2013, pp. 355–366.
[18] M.T. Schultz, The IntelliMetric automated essay scoring engine - a review and [48] H. Bast, E. Haussmann, Open information extraction via contextual sentence
an application to chinese essay scoring, in: M.D. Shermis, J.C. Burstein (Eds.), decomposition, in: Proceedings of the 7th International Conference on Seman-
Handbook of Automated Essay Evaluation: Current Applications and New Di- tic Computing (ICSC), Irvine, California, 2013, pp. 154–159.
rections, Routledge, New York, 2013, pp. 89–98.
[49] P. Gamallo, M. Garcia, S. Fernandez-Lanza, Dependency-based open informa- [62] R.C. Geary, The contiguity ratio and statistical mapping, Incorporated Stat. 5
tion extraction, in: Proceedings of the Joint Workshop on Unsupervised and (3) (1954) 115–145.
Semi-Supervised Learning in NLP (), Avignon, France, 2012, pp. 10–18. [63] A. Getis, J.K. Ord, The analysis of spatial association by use of distance statis-
[50] V. Punyakanok, D. Roth, The use of classifiers in sequential inference, in: Neu- tics, Geogr. Anal. 24 (3) (1992) 189–206.
ral Information Processing Systems 2001, Vancouver, British Columbia, 2001, [64] P. Cassidy, Toward an open-source foundation ontology representing the Long-
pp. 995–1001. man’s defining vocabulary: the COSMO ontology OWL version, in: Proceedings
[51] E. Bengtson, D. Roth, Understanding the value of features for coreference reso- of the Third International Ontology for the Intelligence Community Conference,
lution, in: Proceedings of the Conference on Empirical Methods in Natural Lan- Fairfax, VA, 2009.
guage Processing - EMNLP ’08, Waikiki, Honolulu, Hawaii, 2008, pp. 294–303. [65] G.A. Miller, Wordnet: a lexical database for english, Commun. ACM 38 (11)
[52] D. Chen, C.D. Manning, A fast and accurate dependency parser using neural (1995) 39–41.
networks, in: Proceedings of the 2014 Conference on Empirical Methods in [66] H. Peng, K.-W. Chang, D. Roth, A joint framework for coreference resolution
Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 740–750. and mention head detection, in: Proceedings of the 19th Conference on Com-
[53] T. Gruber, Ontology, in: L. Liu, M.T. Ozsu (Eds.), Encyclopedia of Database Sys- putational Natural Language Learning (CoNLL’15), Association for Computa-
tems, Springer-Verlag, 2009. tional Linguistics, Beijing, China, 2015, pp. 12–21.
[54] S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D.L. McGuinness, [67] C.D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S.J. Bethard, D. McClosky, The
P.F. Patel-Schneider, L.A. Stein, OWL web ontology language, Technical Report, stanford CoreNLP natural language processing toolkit, in: Proceedings of 52nd
2004. Annual Meeting of the ACL: System Demonstrations, Association for Computa-
[55] B. Motik, R. Shearer, I. Horrocks, Hypertableau reasoning for description logics, tional Linguistics, Baltimore, Maryland, 2014, pp. 55–60.
J. Artif. Intell. Res. 36 (2009) 165–228. [68] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python, O’Reilly
[56] L. Perelman, When “the state of the art” is counting words, Assess. Writing 21 Media, 2009.
(2014) 104–111. [69] G.K. Kanji, 100 Statistical Tests, 3rd ed., SAGE Publications, London, Thousand
[57] W.H. Dubay, Smart Language: Readers , Readability , and the Grading of Text, Oaks, New Delhi, 2006.
BookSurge Publishing, 2007. [70] P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Mach. Learn. 63
[58] C. Smith, A. Jönsson, Automatic summarization as means of simplifying texts, (1) (2006) 3–42.
an evaluation for swedish, in: B. Sandford Pedersen, G. Nešpore, I. Skadina [71] M.D. Shermis, H.R. Mzumara, J. Olson, S. Harrington, On-line grading of student
(Eds.), Proceedings of the 18th Nordic Conference of Computational Linguistics essays: PEG goes on the world wide web, Assess. Eval. Higher Educ. 26 (3)
NODALIDA 2011, 2011, pp. 198–205. (2001) 247–259.
[59] A. Mellor, Essay length , lexical diversity and automatic essay scoring, Mem. [72] S.C. Weigle, English as a second language writing and automated essay evalua-
Osaka Inst. Technol. 55 (2) (2011) 1–14. tion, in: M.D. Shermis, J.C. Burstein (Eds.), Handbook of Automated Essay Eval-
[60] P.J. Clark, F.C. Evans, Distance to nearest neighbor as a measure of spatial rela- uation: Current Applications and New Directions, Routledge, New York, 2013,
tionships in populations, Ecology 35 (4) (1954) 445–453. pp. 36–54.
[61] P.A.P. Moran, Notes on continuous stochastic phenomena, Biometrika 37 (1–2)
(1950) 17–23.

Automated Essay Evaluation With Semantic Analysis

Uploaded by

Copyright:

Available Formats

You might also like

Automated Essay Evaluation With Semantic Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automated Essay Evaluation With Semantic Analysis

Uploaded by

Copyright:

Available Formats

Knowledge-Based Systems 120 (2017) 118–132

Contents lists available at ScienceDirect

Automated essay evaluation with semantic analysis

1. Introduction port technology in educational settings. Throughout the develop-

Types of attr. Methodology Prediction model System

Statistical Multiple linear regression PEG [21]

Linear regression e-rater [16]

Bayesian networks BETSY [30]

Statistical LightSIDE [32]

FL, SN Rule-based expert systems SAGrader [11]

chunking), and systems that use dependency parsing (transform-

Lexical sophistication Grammar Mechanics

Fig. 2. Construction of semantic coherence attributes.

|r ≤ r | - Moran’s I assesses the overall clustering pattern. The original

bor, and N is the number of points. The measure expresses the N 1

3.2.2. Spatial data analysis where Dki , k = 1, . . . , n; i = 1, . . . , N is a kth coordinate compo-

dispersion. We compute it by dividing the standard distance 1

Basic coherence measures Spatial data analysis

the semantic space. A weighting function wij (d) is used to as-

As a result, we extracted 29 different semantic attributes that

3.3. Automated error detection system and novel semantic

In the last years, many researchers have debated [12–14] that

In the remaining sections we proceed to evaluating the beneﬁts

Characteristic DS1 DS2 DS3 DS4 DS5 DS6 DS7 DS8

where w are weights, O is a matrix of observed ratings and E is

We also compared the proposed system SAGE with the state-of-

Attribute Average rank

1. Number of characters 1.3333

Cos = Cosine Similarity

System DS3 DS4 DS5 DS6 Average

QW Kappa 0.8096 0.8040 0.8701 0.7736 0.8143

QW Kappa 0.8272 0.8109 0.8729 0.7817 0.8232

QW Kappa 0.8340 0.8120 0.8791 0.7880 0.8283

p-value AGE - SAGE 0.0086a 0.0349a 0.0071a < 0.001a 0.0078a

6. Providing automated feedback

One of the main advantages of the system SAGE is that it

You might also like

bor, and N is the number of points. The measure expresses the N 1

dispersion. We compute it by dividing the standard distance 1