Professional Documents
Culture Documents
Automated Essay Evaluation With Semantic Analysis
Automated Essay Evaluation With Semantic Analysis
Automated Essay Evaluation With Semantic Analysis
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
a r t i c l e i n f o a b s t r a c t
Article history: Essays are considered as the most useful tool to assess learning outcomes, guide students’ learning pro-
Received 24 March 2016 cess and to measure their progress. Manual grading of students’ essays is a time-consuming process,
Revised 12 October 2016
but is nevertheless necessary. Automated essay evaluation represents a practical solution to this task,
Accepted 1 January 2017
however, its main weakness is the predominant focus on vocabulary and text syntax, and limited consid-
Available online 3 January 2017
eration of text semantics. In this work, we propose an extension of existing automated essay evaluation
Keywords: systems by incorporating additional semantic coherence and consistency attributes. We design the novel
Automated scoring coherence attributes by transforming sequential parts of an essay into the semantic space and measuring
Essay evaluation changes between them to estimate coherence of the text. The novel consistency attributes detect seman-
Natural language processing tic errors using information extraction and logic reasoning. The resulting system (named SAGE - Semantic
Semantic attributes Automated Grader for Essays) provides semantic feedback for the writer and achieves significantly higher
Semantic feedback
grading accuracy compared with 9 other state-of-the-art automated essay evaluation systems.
© 2017 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.knosys.2017.01.006
0950-7051/© 2017 Elsevier B.V. All rights reserved.
K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132 119
the statements in the essays. Despite the efforts, the latter systems and semantic attributes. Style attributes focus on lexical sophis-
are not automatic, as they require manual interventions from the tication, grammar and mechanics (spelling, capitalization, and
user. punctuation). Content attributes vaguely describe semantics of
In this paper, we propose an extension to existing state-of-the- an essay and are based on comparing an essay with source text
art AEE systems that incorporates additional novel attributes for and other already graded essays. Semantic attributes are based
measuring coherence (semantic development) and consistency of on verifying the correctness of content meaning.
facts (compared to common sense knowledge and other facts in • Methodology: Different systems use various approaches to ex-
essays). The proposed coherence attributes measure distance, spa- tract attributes from essays. The most widely used methodology
tial patterns, and spatial autocorrelation between parts of the es- is based on NLP. Systems focusing on content are mostly using
say. The consistency attributes measure the number of semantic Latent Semantic Analysis (LSA) - a machine learning method
errors in a student essay using information extraction and logical that analyses related concepts between a set of documents
reasoning. The most significant contribution of the proposed sys- and the contained terms. LSA assumes that words with simi-
tem is that by detecting semantic errors, the system is also able to lar meaning occur in similar parts of text. To evaluate content,
provide semantic feedback about the essay. As discussed in many systems use pattern matching techniques (PMT) and extensions
papers [12–14], the goal of the AEE systems is no longer to ac- to LSA such as Generalized Latent Semantic Analysis (GLSA)
curately reproduce the human graders scores, which are inconsis- (which uses an n-gram-by-document matrix instead of a word-
tent in their grading, but to provide valid scores and consequently by-document matrix) and improvement that considers seman-
also immediate, accurate, and informative feedback. This feedback tics by means of the syntactic and shallow semantic tree ker-
is important so that the students can achieve progress and improve nels [20]. For verifying the correctness and consistency of con-
their writing. We compare our system (called SAGE - Semantic Au- tent, approaches such as (Open) Information Extraction ((O)IE),
tomated Grader for Essays) with existing state-of-the-art systems Semantic Networks (SN), Ontologies, Fuzzy Logic (FL), and De-
and show that SAGE achieves significantly higher grading accuracy scription Logic (DL) are used.
compared with 9 other state-of-the-art automated essay evaluation • Prediction model: The majority of the systems use machine
systems. learning algorithms (usually regression modeling) to predict the
The paper is divided into seven sections. Section 2 describes final grade. An alternative is to use: the Lexile measure - an es-
several subfields of the related work that are relevant to our re- timate of student’s ability to express language in writing based
search. Section 3 describes the augmented essay evaluation system on semantic complexity (level of expressing) and syntactic so-
including the novel attributes, Section 4 presents the implemen- phistication (how the words are combined into sentences); co-
tation and evaluation of the proposed system and Section 5 the sine similarity; and rule-based expert systems.
results of our system SAGE. Section 6 describes the semantic feed-
back and Section 7 draws conclusions. Table 1 provides a comparison of characteristics for the majority
of AEE systems and approaches, including proprietary (non-public)
2. Related work systems, two publicly available systems, and approaches proposed
by the academic community.
The development of the AEE field has been carried out in differ- The majority of the systems in Table 1 were evaluated on a sub-
ent problem areas that we highlight in the following subsections. stantially large set of prompt-specific essays that were pre-scored
by human expert graders. These datasets were divided into train-
2.1. Automated essay evaluation systems ing and test sets. The training set was used to develop the scoring
model of the AEE system. This scoring model was then used to as-
The first AEE system was proposed almost 50 years ago. In sign scores to essays in the test set. The performance of the scor-
1966, the high school English teacher E. Page [2] proposed the ing model was validated by calculating how well the scoring model
Project Essay Grade – the first automated system for grading stu- “replicated” the scores assigned by the human expert graders [24].
dent essays. He saw the system as a solution to reducing hours of
manually grading student essays. Despite its impressive success at
2.2. Measuring coherence of the text
predicting teachers’ essay ratings, the early version of the system
received only limited acceptance in writing and education commu-
Coherence is a concept that describes the flow of information
nity. The availability of necessary tools (home computers, Internet,
from one part of discourse to another and ranges from lower level
computational techniques for automatically extracting measures of
cohesive elements such as coreference, causal relationship, and
writing quality, ...) was poor and the society criticised the idea of
connectives, up to higher level elements that evaluate connections
displacing human graders [15]. The later widespread use of the In-
between the discourse and reader’s mental representation of it [7].
ternet, word processing software and NLP accelerated the develop-
Existing systems measure coherence in noisy text with different
ment of AEE systems.
supervised and unsupervised approaches. The unsupervised ap-
In the past, one of the main obstacles to achieve progress in
proaches usually measure lexical cohesion, i.e. repetition of words
this area was lack of open-source AEE systems, which would al-
and phrases in an essay. Foltz et al. [6,7] assume that coherent
low insight into their grading methodology. Four commercial sys-
texts contain a high number of semantically related words and
tems achieved the predominance in this field: Project Essay Grade
measure coherence as a function of semantic relatedness between
(PEG) [2], E-rater [16], Intelligent Essay Assessor (IEA) [17], and In-
adjacent sentences. Relatedness can be computed using LSA with-
telliMetric [18]. In 2010, Mayfield and Rosé released LightSIDE [19],
out employing syntactic or other annotations. Hearst [35] sub-
an automated evaluation engine with both compiled and source
divides texts into multi-paragraph units that represent subtopics
code publicly available. This program is designed as a tool for non-
and identifies patterns of lexical co-occurrence and distribution, i.e.
experts for a variety of purposes, including essay assessment. In
identifying repetition of vocabulary across adjacent sentences.
order to compare technical, methodological, and structural charac-
The supervised learning approaches require annotated data
teristics of the state-of-the-art systems, we present their compari-
(graded essays). They focus on occurrences of discourse elements
son in Table 1 using the following main criteria:
(e.g. thesis statement, main idea, conclusion), entity sentence roles,
• Type of attributes: The attributes describing the quality of an grammar errors, and word usage. Miltsakaki and Kukich [36] have
essay can roughly be divided into three groups: style, content explored the role of centering theory [37] in locating topic shifts
120 K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132
Table 1
A comparison of the key features of the state-of-the art AEE systems.
in student essays. Centering theory argues that the discourse in knowledge using logic reasoning [10]. The system extracts state-
a text contains a set of textual segments, each containing dis- ments using Open Information Extraction (OIE) and adds them to
course entities, which are then ranked by their importance. Topic the domain ontology. The extracted tuples are in a form that is
shifts are generated by short-lived topics and are indicative of poor compatible with the OWL ontology. In the final step, the system
topic development. Higgins et al. [8] have developed a system that determines the correctness of a statement through an ontology-
computes similarity across text segments based on their type of based consistency checking. If the domain ontology becomes in-
discourse element and semantic similarity (LSA). A support vec- consistent after the extracted sentence is added into it, then the
tor machine (SVD) uses these features to capture breakdowns in sentence is incorrect with the respect to the domain [10]. Despite
coherence due to relatedness to the essay question and related- many efforts, this system is still not fully automatic, as it requires
ness between discourse elements. More recently, Burstein et al. manual inputs from the user.
[9,38] showed how the Barzilay and Lapata [39] algorithm can
be applied to the domain of student essays. In Barzilay and Lap-
ata’s [39] approach, entities (nouns and pronouns) are represented 2.4. Open information extraction
by their sentence roles and the algorithm counts all possible entity
transitions between adjacent sentences in the text. By combining Information extraction is the task of automatically acquiring
those entity-based features with features related to grammar er- knowledge by transforming natural language text into structured
rors and word usage, Burstein and colleagues [38] improve the per- information, such as a knowledge base [43]. The main tasks of in-
formance of automated coherence prediction for student essays. formation extraction are entity recognition, relation extraction, and
coreference resolution. We focused on a tool for relation extraction
2.3. Detecting semantic errors in an essay called Open Information Extraction (OIE). Wu and Weld [44] define
the OIE system as a function that maps an unstructured document
Only two mentioned systems [10,11] partially check if the state- text d, to a set of triples, {<arg1, rel, arg2>}, where the args are
ments in the essays are correct. SAGrader, developed by Brent [11], noun phrases and rel is a textual fragment indicating an implicit,
was the first AEE system that detected semantic information in an semantic relation between the two noun phrases. Unlike other re-
essay and upon which we based the architecture of our system. lation extraction methods focused on a predefined set of target re-
For SAGrader, the teacher first specifies the assignment prompt lations, the open information extraction paradigm is not limited
and desired features along with relationships among them. Using to a small set of target relations known in advance, but extracts
fuzzy logic, the system recognizes word combinations that can be new types of relations found in the text. The main properties of
used by students to detect desired features and relationships. De- OIE systems are domain independency, reliance on unsupervised
sired knowledge in the form of a semantic network is then com- extraction methods, and scalability to large amounts of text [45].
pared with the knowledge detected in a student’s essay. The sys- Gamallo [45] categorized the existing OIE systems in four
tem scores the student’s essay based on the similarities between groups. First he divided them into two broad categories: systems
observed and desired knowledge using procedural rules. Detailed that require training data to learn a classifier and systems based
feedback indicates what student did right and wrong [11]. on hand-crafted rules or heuristics. In addition, each former cat-
Gutierrez et al. [10,40–42] later proposed a system that not only egory can be divided in two subsequent types: systems that use
detects the desired (correct) knowledge but also detects incorrect the shallow syntactic analysis (e.g. part-of-speech tagging and/or
K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132 121
Table 2
Linguistic (lexical sophistication, grammar, mechanics) and content attributes.
1. Number of characters, 29. Number of different PoS tags 65. Number of spellchecking errors,
2. Number of words, 30. Height of the tree presenting 66. Number of capitalization errors,
3. Number of long words, sentence structure, 67. Number of punctuation errors,
4. Number of short words, 31. Correct verb form,
5. Most frequent word length, 32. Number of grammar errors,
6. Average word length, Number of each PoS tag
7. Number of sentences, 33. Coordinating conjunction,
8. Number of long sentences, 34. Numeral,
9. Number of short sentences, 35. Determiner,
10. Most frequent sentence length, 36. existential there, Content
11. Average sentence length, 37. Preposition/subordinating 68. Cosine similarity with source text,
12. Number of different words, conjunction, 69. Score point level for maximum cosine similarity over
13. Number of stopwords, 38. Adjective, all score points,
Readability measures [57,58] 39. Comparative adjective, 70. Cosine similarity with essays that have highest score
14. Gunning Fox index, 40. Superlative adjective, point level,
15. Flesch reading ease, 41. Ordinal adjective or numeral, 71. Pattern cosine [5],
16. Flesch Kincaid grade level, 42. Modal auxiliary, 72. Weighted sum of all cosine correlation values [5].
17. Dale-Chall readability formula, 43. Singular or mass common noun,
18. Automated readability index, 44. Plural common noun,
19. Simple measure of Gobbledygook, 45. Singular proper noun,
20. LIX, 46. Plural proper noun,
21. Word variation index, 47. Preposition,
22. Nominal ratio, 48. Participle,
Lexical diversity [59] 49. Predeterminer,
23. Type-token-ratio, 50. Genitive marker,
24. Guiraud;s index, 51. Personal pronoun,
25. Yule;s K, 52. Possessive pronoun,
26. The D estimate, 53. Adverb,
27. Hapax legomena - number of words 54. Comparative adverb,
occurring only once in a text, 55. Superlative adverb,
28. Advanced Guiraud, 56. Particle, “to” as preposition or
infinitive marker,
57. Verb - base form,
58. Verb - past tense,
59. Verb - gerund/present participle,
60. Verb - past participle,
61. Verb - 3rd person sing. present,
62. wh-determiner,
63. wh-pronoun,
64. wh-adverb,
Fig. 2 we use the same example to illustrate the definition of our proposed attributes are:
semantic coherence attributes, which we explain in the following
- average distance between neighboring points in semantic
subsections.
space (denoted by thin grey lines in Fig. 2a). Foltz [7] has
already shown by measuring cosine similarity between sen-
tences in an essay that highly coherent discourses have small
3.2.1. Basic coherence measures movements in semantic space and vice versa. We defined simi-
Basic coherence measures measure the distance between parts lar attributes that describe the average distance between these
of the essay, which are represented as points in the semantic points;
space. We use two variants of each attribute in this group: one - minimum and maximum distance between neighboring
computed using the Euclidean distance metric and the other com- points and their quotient;
puted using the cosine similarity (in the following we will use the - average distance between any two points, which measures
term “distance” to interchangeably denote any of the two). The how well an idea persists within the essay;
K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132 123
- maximum difference between any two points measures the where dmax is a maximum distance of any point from the cen-
diameter of area that is covered with points and thus the troid. This enables direct comparison of the dispersion of differ-
breadth of the discussed concept in the space (illustrated in ent point patterns from different areas, even if the areas are of
Fig. 2a); varying sizes;
- Clark and Evans’ [60] distance to the nearest neighbor of - the determinant of the distance matrix, a measure of spatial
each point in the semantic space for measuring spatial relation- dispersion. This allows us to measure dispersity of the content
ships:N and consequently how broad the discussed topic is.
i=1 ri √
N 2 N Ni=1 ri
R= = (2) 3.2.3. Spatial autocorrelation
1 N
√ Measures of spatial autocorrelation express how data tends to
2 N be clustered together in space (positive spatial autocorrelation) or
where r is the distance from a given point to its nearest neigh- dispersed (negative spatial autocorrelation). They enable us to de-
bor (see Fig. 2b) and N is the number of points. It is the mea- tect global and local semantic coherence of the essays’ content.
sure of the degree to which the observed distribution differs If the essay exhibits positive spatial autocorrelation, this indicates
from random expectation with respect to the nearest neigh- that it is well structured and that the parts of the essay are well
bor [60]; related to each other.
- average distance to the nearest neighbor, which measures Typical measures of spatial autocorrelation are Moran’s I [61],
how fast an idea develops across an essay (see Fig. 2b); Geary’s C [62], and Getis’s G [63]. We adjusted these three measures
- cumulative frequency distribution G of the nearest neighbors’ so we can use them in our high-dimensional semantic space as
distances: follows:
2 (N − 1 ) 1
n wi j (Dki − Dkj )2
n N C= ·
i=1 j=1
k=1 i=1 Dki − Dkc 2 n N N (7)
SD = (4) k=1 i=1 j=1 wi j (Dki − Dkc )2
N
where Dki , k = 1, . . . , n; i = 1, . . . , N is a kth coordinate compo-
where Dki , k = 1, . . . , n; i = 1, . . . , N is a kth coordinate compo- nent of point i, Dkc is a kth coordinate component of a mean
nent of point i, Dkc is a kth coordinate component of a mean center, n is the number of dimensions, N is the number of
center, n is the number of dimensions, and N is the number of points, and wij are point weights as described previously.
points. Similar to the standard deviation, the standard distance - Gettis’s G enables us to examine point patterns at a more lo-
is also strongly influenced by extreme values. Because distances cal scale. Gettis’s G measures overall concentration or lack of
to the mean center are squared, the atypical points have a dom- concentration of all pairs of values (Di Dj ) such that i and j are
inant impact on the magnitude of this metric, which allows de- within distance d of each other. We adjusted the measure to
tecting deviating (incoherent) essay parts; use it in a high-dimensional space:
- relative distance, a descriptive measure of the relative spatial N N
Table 3
List of novel semantic attributes. The attributes that are denoted with “(2x)” were computed twice, once using the Euclidean distance and once
using the cosine similarity.
1–2 Average distance between neighboring points (2x), 16–17 Average distance between points and centroid (2x),
3–4 Minimum distance between neighboring points (2x), 18–19 Minimum distance between points and centroid (2x),
5–6 Maximum distance between neighboring points (2x), 20–21 Maximum distance between points and centroid (2x),
7–8 Index (minimum distance/maximum distance) (2x), 22–23 Index (minimum distance/maximum distance) (2x),
9–10 Average distance between any two points (2x), 24. Standard distance,
11–12 Maximum difference between any two points (2x), 25. Relative distance,
13. Clark’s and Evans’ distance to nearest neighbor, 26. Determinant of distance matrix,
14. Average distance to nearest neighbor,
15. Cumulative frequency distribution, Spatial autocorrelation
27. Moran’s I,
28. Geary’s C,
29. Getis’s G.
2. Domain knowledge: In addition to source text knowledge, the shown in Algorithm 1. If the ontology is inconsistent or has an un-
system supplements the base ontology with domain knowledge satisfiable concept after adding an extraction, the system concludes
that contains knowledge about the wider scope of an essay in that there is a semantic error in the essay. The system remem-
a form of an ontology, including synonyms and hypernyms. For bers where the error occurred to later provide detailed feedback.
example, if students write an essay about genes and biology, The system then deletes the relation from the ontology and contin-
we add a Gene Ontology.6 ues with the next extraction. Whenever an extraction is processed,
3. Target knowledge: The source text and the domain knowledge the system first looks for both entities (predicates) in the ontol-
represent knowledge about a specific domain. Professors or as- ogy. If the ontology does not yet contain any of them, the Word-
sessors can add specific desired knowledge which they explic- Net [65] taxonomy is used to find synonyms or coreferenced enti-
itly expect the students to express in an essay. The presence of ties within the ontology. If the ontology still does not contain any
the target knowledge in an essay can have an important role synonyms or coreferences, the system looks for hypernyms of the
when grading an essay. Detection of this knowledge can in- entity and creates a subclass (i.e. creating a triplet with subclas-
crease the accuracy of grading and improve the feedback qual- sOf relation). If all described attempts fail, the last alternative is to
ity. create a new class or individual in the ontology (see Section 2.5).
When both entities are included in the ontology, the system first
3.3.2. Processing of the ungraded essay checks if the specific relation is not yet a part of the ontology and
Preprocessing. In the preprocessing phase the system first adds it accordingly.
reads an essay and breaks it into sentences. Then it creates a du-
plicate of each sentence and does several preprocessing steps: tok- Algorithm 1 Automated error detection (AED) system.
enization; part-of-speech tagging; finding and labeling stopwords,
punctuation marks, determiners and prepositions; transformation Input: common sense ontology, domain knowledge, target
to lower-case; and stemming. knowledge, source text, ungraded essay
Entity recognition. Shallow Parsing is the process of identifying Output: error detections
syntactical phrases in natural language sentences. A shallow parser function main(common_sense_ontology, domain_knowledge,
identifies several kinds of phrases (chunks) that are derived from target _knowledge, source_text, ungraded_essay)
parse trees; i.e. noun phrase (NP), verb phrase (VP), prepositional function extract(text, ontology)
phrase (PP), adverb phrase (ADVP), and clause introduced by sub- Preprocessing
ordinating conjunction (SBAR). These chunks provide an interme- Entity recognition
diate step to natural language understanding. Although identifying Coreference resolution
whole parse trees can provide deeper analyses of the sentences, it Open information extraction
is a much harder problem [50]. for sentence in text do
Our system uses the Illinois Shallow Parser [50] to determine for relation in sentenceRelations do
chunks which we can later use for coreference resolution, search- Add to ontology
ing for a suitable chunk when connecting extractions with a parts HermiT check
of sentences, and matching chunks with individuals, classes and print er ror s
relations within the ontology. end for
Coreference resolution. A given entity - representing a person,
end for
a location, or an organization - can be mentioned in text in mul-
return (ontology, errors)
tiple, ambiguous ways. Understanding natural language and sup-
end function
porting intelligent access to textual information requires identify-
ing whether different entity mentions are actually referencing the
same entity [51]. The coreference resolution processes unannotated
ontology ← common_sense_ontology
essay text and shows which mentions are coreferential. (ontology, ) ← extract(source_text, ontology)
Our system uses two different coreference resolution sys- ontology ← ontology + domain_knowledge
tems (the Illinois Coreference Resolution [66] and the Stanford ontology ← ontology + target _knowledge
Parser [67]) to detect coreferences in an essay and use them when ( , er ror s) ← extract(ungraded_essay, ontology)
adding extractions to the ontology. The system combines corefer- return er ror s
ences discovered by both systems and thus increases the accuracy end function
of discovered coreferences.
Open information extraction. After the above phases, the sys-
tem performs information extraction using four systems and re-
turns triples, as we have described in Section 2.4. Within the pro-
cess, the duplicate extractions are removed, as well as the faulty 3.3.4. Semantic error detection attributes
extractions (e.g. those consisting only of a subject and a relation, Based on detections of semantic errors in an essay that were
while an object is missing). detected by the logic reasoner (as explained in the previous sec-
After all of the previously described phases, the system starts tion) we implemented three error attributes:
to process each sentence sequentially and adds each extraction to
the ontology by utilizing the logic reasoner. 1. number of unsatisfiable cases when adding classes and indi-
viduals in the ontology,
3.3.3. Logic reasoner 2. number of inconsistency errors after adding a triple to ontol-
After obtaining the base ontology and extractions from the es- ogy,
say, we can start discovering semantic errors. To achieve this, we 3. total number of consistency errors (sum of the first two at-
use Hermit, the logic reasoner [55] (described in Section 2.5), as tributes).
Table 4
Properties of essay datasets.
Type of essay Persuasive Persuasive Source-based Source-based Source-based Source-based Expository Narrative
Number of essays 1783 1800 1726 1771 1805 1800 1569 723
Training Mean number of words 366.40 381.19 108.69 94.39 122.29 153.64 171.28 622.13
set SD of number of words 120.40 156.44 53.3 51.68 57.37 55.92 85.2 197.08
Range of grades 2–12 1–6 1–4 0–3 0–3 0–4 0–4 0–24 0–60
Mean grade 8.53 3.42 3.33 1.85 1.43 2.41 2.72 19.98 37.23
Test set Number of essays 589 600 568 586 601 600 441 233
Mean number of words 368.96 378.4 113.24 98.7 127.17 152.28 173.48 639.05
SD of number of words 117.99 156.82 56.0 53.84 57.59 52.81 84.52 190.13
Range of grades 2–12 1–6 1–4 0–3 0–3 0–4 0–4 0–24 0–60
Mean grade 8.62 3.41 3.32 1.9 1.51 2.51 2.75 20.13 36.67
SD = standard deviation
4. Implementation and evaluation by two different graders (e.g. human/computer), the metric is
calculated as follows:
We extracted the proposed attributes from the text by using the
wi, j Oi, j
Natural Language Toolkit (NLTK) [68] for natural language process- κ = 1 − i, j (9)
wi, j Ei, j
ing in Python and a spellchecking library PyEnchant.7 i, j
- the exact agreement measure that is defined as the percentage 4.3. Prediction model
of essays that were graded equally by human grader and AEE
system, We experimented with linear regression, regression trees, neu-
- the quadratic weighted Kappa, which is an error metric that ral network, random forest, and extremely randomized trees to
measures the degree of agreement between two graders (in predict the final grade. Table 5 shows the results of the Kappa met-
case of AEE this is an agreement between the automated scores ric for each classifier. Since the results showed that the random
and the resolved human scores) and is an analogy to the cor- forest and extremely randomized trees [70] achieved the highest
relation coefficient. This metric typically ranges from 0 (ran- performance, we decided to use them as essay grade predictors
dom agreement between graders) to 1 (complete agreement be- in further evaluation. Their key properties and parameters were as
tween graders). In case that there is less agreement between follows:
the graders than expected by chance, this metric may go be-
- Random forest: we used the “randomForest” package9 in R; 100
low 0. Assuming that a set of essay responses E has S different
trees; sampling of cases is done with replacement; the number
possible ratings, 1, 2, . . . , S, and that each essay received scores
of attributes randomly sampled as candidates at each split is
|at t ributes|
3 ;
7
https://pythonhosted.org/pyenchant/.
8
Access to data can be requested through the Kaggle website http://www.kaggle.
9
com/c/asap-aes/data or ASAP website http://www.scoreright.org/. http://cran.r-project.org/web/packages/randomForest/randomForest.pdf.
K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132 127
Table 5
Comparison of different regression models: linear regression (LR), regression trees (RT), neural network (NN), random forest
(RF), and extremely randomized trees (ERT).
Model DS1 DS2a DS2b DS3 DS4 DS5 DS6 DS7 DS8 Average
LR 0.8359 0.7232 0.5175 0.6535 0.7090 0.7900 0.7663 0.7781 0.7785 0.7280
RT 0.8070 0.6943 0.4885 0.6620 0.7113 0.7828 0.7184 0.7323 0.7289 0.7028
NN 0.8247 0.6964 0.4883 0.6328 0.6877 0.7776 0.7428 0.7601 0.7247 0.7039
RF 0.8447 0.7389 0.5386 0.6591 0.7174 0.7949 0.7636 0.7888 0.7738 0.7356
ERT 0.8434 0.7439 0.5384 0.6554 0.7148 0.7967 0.7670 0.7882 0.7807 0.7365
- Extremely randomized trees: a model similar to random for- tion rubric also describes the coherence, we decided to further in-
est, but it uses the same data to train all trees in a set and vestigate how well our proposed coherence attributes predict the
chooses splitting nodes randomly among variables. We used the organization rubric score. In this experiment, we prepared datasets
“extraTrees” package10 in R; the ensemble contained 100 trees; with three different sets of attributes: (a) coherence attributes only
the number of attributes tried at each node was |at t ributes
3
|
; the (29), (b) syntax (linguistic and content) attributes only (72), and
number of random cuts for each (randomly chosen) attribute (c) syntax and coherence attributes (101). Table 7 shows the re-
was 1 (default), which corresponds to the official ExtraTrees sults using quadratic weighted Kappa and p-values. Based on the
method; cutting thresholds are uniformly sampled. high influence of the number of characters and words on the fi-
nal grade, we expected high prediction accuracy already by using
Since both models continued to achieve similar results in the the set of syntax attributes only. Nevertheless, by adding the set
following experiments, we report only the results for the random of coherence attributes to the set of syntax attributes, the accuracy
forest model. additionally increased. We can also see that the set of coherence
attributes alone also achieved relatively high prediction accuracy,
5. Results which enabled us to conclude in favor of their benefit.
We additionally also calculated the Spearman coefficients be-
To analyze the potential benefits of the proposed attributes, we tween the coherence attributes and the organization rubric score.
first evaluated their relevance and contribution to predictive accu- Getis’s G and Moran’s I achieved the highest absolute correlations
racy. We proceed by comparing predictive accuracy of three dif- with 0.5947 and 0.5752, respectively (p-values < 0.001). Overall 19
ferent versions of our AEE system and continue by comparing the of 29 coherence attributes correlate with the organization rubric
best system to other state-of-the-art systems. score with p-value smaller than 0.05.
5.1. Evaluation of the implemented attributes 5.2. Accuracy of the semantic-based AEE system
As described in Section 3, we implemented 104 existing and In our third experiment, we compared three versions of our
novel attributes (72 linguistic and content attributes, 29 coherence system to evaluate if semantic attributes yield to better model per-
attributes, and 3 consistency attributes). To improve model inter- formance:
pretability, achieve shorter training times and enhance generaliza-
tion by reducing overfitting, we performed attribute selection to 1. AGE: The system with only linguistic and content attributes
detect redundant and irrelevant attributes. Attribute selection was (described in Section 3.1),
performed using the forward attribute selection approach. Start- 2. AGE+: system AGE, augmented with additional coherence at-
ing with an empty set of attributes, an attribute that improves the tributes (described in Section 3.2),
model performance the most (measured by the quadratic weighted 3. SAGE: system AGE+, augmented with additional consistency at-
Kappa measure) was included into the set in each iterative step. tributes (described in Section 3.3).
The procedure was terminated when there were no more at-
Since the system SAGE requires source-based essays to build an
tributes that improved the model performance. The model perfor-
ontology for the logic reasoner, we were able to evaluate it only
mance was measured using the internal ten-fold cross-validation,
on datasets that include source-based essays (such datasets are 3,
and the best attribute in each step was selected according to the
4, 5, and 6).
highest average Kappa value among all folds.
Table 8 shows the quadratic weighted Kappas and exact agree-
The ranks of the 50 most relevant attributes, averaged across
ment for AGE and AGE+. The results show that the prediction
all data sets, are shown in Table 6 in decreasing order of the aver-
accuracy significantly (p-value< 0.05) improves on 8 out of 9
age rank. From the ranking we can see that the number of charac-
datasets when the coherence attributes are used in the system.
ters and words influence the final grade most, as well as the score
The comparison of the average results in the rightmost column of
point level that uses cosine similarity between already graded es-
Table 8 shows that there is also a significant difference between
says and a new essay. We can observe that some of the pro-
both systems over all datasets.
posed coherent attributes rank high among other attributes: Getis’s
Table 9 shows that consistency attributes helped achieve higher
G (13.), determinant of distance matrix (16.), minimum distance
Kappa values on all four observed datasets. However, the improve-
between neighboring points (Euclid) (21.), relative distance (22.),
ments were significant only on two out of 4 datasets, and not
Clark’s and Evans’ distance to nearest neighbor (23.), and Moran’s I
on the average (the rightmost column). We nevertheless argue (in
(24.). The highest ranked proposed content attribute – sum of con-
Section 6) that SAGE contributes another valuable benefit – a feed-
sistency errors – is in the 37. place.
back for students.
Dataset 8, in addition to the final score, provides 6 rubric scores
describing ideas and content, organization, voice, word choice, sen-
5.3. Comparison with the state-of-the-art AEE systems
tence fluency, and convention for each essay. Since the organiza-
Table 6
Average ranks of 50 most relevant attributes within all 104 attributes. The ranks were obtained using the forward attribute selection.
of 2012: PEG, e-rater, IntelliMetric, CRASE, LightSIDE, AutoScore, above listed systems capture over 97% of the current automated
IEA, Bookette, Lexile Writing Analyzer, with a ranked-based ap- scoring market in the USA [26].
proach [28] and with results obtained by researchers participat- Tables 10 and 11 show the results that were calculated be-
ing in the before mentioned Automated Essay Scoring competition tween the automated and human scores (resolved score of more
on the Kaggle website. The eight commercial systems among the human graders). Since not all the systems are available for
public experimenting, their results were obtained from the pa-
K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132 129
Table 8
Comparison of the system AGE (syntactic attributes only) and the system AGE+ (with additional coherence attributes) using quadratic weighted
Kappa (1st row) and exact agreement (2nd row), p-values are computed for Kappas.
System DS1 DS2a DS2b DS3 DS4 DS5 DS6 DS7 DS8 Average
QW Kappa 0.9045 0.7473 0.6619 0.8096 0.8040 0.8701 0.7736 0.8760 0.7851 0.8036
AGE
Exact agg. 0.7224 0.7716 0.7379 0.7886 0.7237 0.7847 0.7314 0.2607 0.2219 0.6381
QW Kappa 0.9251 0.7924 0.6714 0.8272 0.8109 0.8729 0.7817 0.8814 0.8050 0.8187
AGE+
Exact agg. 0.7507 0.8057 0.7481 0.8036 0.7375 0.7805 0.7400 0.2627 0.1577 0.6430
p-value < 0.001a < 0.001a 0.0416a 0.0116a 0.0398a 0.0205a < 0.001a 0.0201a 0.0570 0.0083a
a
p-value< 0.05
Table 9
Comparison of the systems AGE (syntactic attributes only), AGE+ (syntactic and coherence attributes)
and SAGE (syntactic, coherence and consistency attributes) on source-based datasets using quadratic
weighted Kappa (1st row) and exact agreement (2nd row), p-values are computed for Kappas.
Table 10
Comparison of the proposed semantic grading system SAGE with other state-of-the-art systems. The table shows
quadratic weighted Kappas, achieved on different datasets. Significant values (p < 0.05) are marked with an asterisk.
System DS1 DS2a DS2b DS3 DS4 DS5 DS6 DS7 DS8 Average
SAGE 0.93 0.79 0.67 0.83 0.81 0.89 0.79 0.88 0.81 0.83
PEG 0.82a 0.72a 0.70 0.75a 0.82 0.83a 0.81 0.84a 0.73 0.79
e-rater 0.82a 0.74a 0.69 0.72a 0.80 0.81a 0.75 0.81a 0.70a 0.77a
IntelliMetric 0.78a 0.70a 0.68 0.73a 0.79 0.83a 0.76 0.81a 0.68a 0.76a
CRASE 0.76a 0.72a 0.69 0.73a 0.76a 0.78a 0.78 0.80a 0.68a 0.75a
LightSIDE 0.79a 0.70a 0.63 0.74a 0.81 0.81a 0.76 0.77a 0.65a 0.75a
ranked-based 0.81a 0.68a 0.68 0.67a 0.73a 0.80a 0.72a 0.77a 0.71a 0.74a
AutoScore 0.78a 0.68a 0.66 0.72a 0.75a 0.82a 0.76 0.67a 0.69a 0.73a
IEA 0.79a 0.70a 0.65 0.65a 0.74a 0.80a 0.75 0.77a 0.69a 0.73a
Bookette 0.70a 0.68a 0.63 0.69a 0.76a 0.80a 0.64a 0.74a 0.60a 0.70a
Lexile 0.66a 0.62a 0.55a 0.65a 0.67a 0.64a 0.65a 0.58a 0.63a 0.63a
a
p-value< 0.05
Table 11 pers [26] and [28], and from the Kaggle website.11 Results reported
Accuracy comparison of various systems from the literature and in [26] and [28] include Kappa values for every data set and are
results from the Kaggle competition.
reported in Table 10 together with results of our system. The eval-
System Avg. acc. Rank uated systems are sorted in descending order of the average Kappa
SAGE 0.8325 1 value, which is shown in the rightmost column of Table 10. Since
Sollers & Gxava 0.8014 2 dataset 2 has scores in two different domains, each transformed
SirGuessalot & PlanetThanet & Stefana 0.7986 3 Kappa is weighted by 0.5. The Wilcoxon non-parametric test was
VikP & jmana 0.7978 4 used to compute p-values that express the significance of differ-
Efimov+Berengueresa 0.7956 5
ences between each evaluated system and the SAGE system. We
@ORGANIZATIONa 0.7947 6
PEG [21] 0.7888 7 can see that our system achieves significantly better results on 5
Martina 0.7857 8 out of 9 datasets (DS1, DS2a, DS3, DS5, DS7). On the remaining
cs224ua 0.7828 9 four datasets, accuracy of SAGE was statistically insignificantly dif-
jackpot (Jason)a 0.7826 10
ferent than the accuracy of the best performing system, while still
e-rater [16] 0.7656 11
IntelliMetric [18] 0.7588 12
significantly better compared with some of the systems. On the
CRASE [25] 0.7494 13 average (the rightmost column), SAGE achieved significantly bet-
LightSIDE [32] 0.7494 14 ter results than 9 out of 10 other systems. We have also compared
Ranked-based [28] 0.7363 15 SAGE with results obtained from the leader board of Automated
AutoScore [26] 0.7325 16
Essay Scoring competition. In Table 11 we ranked 8 commercial
IEA [17] 0.7344 17
Bookette [23] 0.6981 18 systems, 8 leading systems from the competition, LightSide [32],
Lexile [27] 0.6331 19 ranked-based system [28] and SAGE. The results are reported in
a
Results were obtained from the leader board of AES com-
petition on Kaggle website11 . 11
http://www.kaggle.com/c/asap-aes/data.
130 K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132
the form of the average Kappas over all datasets, since the accu- more equal relations with different classes (e.g. like or isA relation)
racy of 8 leading systems on the Kaggle website is reported like as long as these classes are not disjoint (e.g. Lisa can be a girl and
that. a student).
Fig. 6. The system detects disjointness of hypernyms and reports an error with a reference to a relation in the ontology that contradicts the extraction from a written
sentence.
K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132 131
also helps students become more autonomous during their learn- [19] E. Mayfield, C. Penstein-Rosé, An interactive tool for supporting error analysis
ing process. Systems with feedback can be an aid, not a replace- for text mining, in: Proceedings of the NAACL HLT 2010 Demonstration Ses-
sion, 2010, pp. 25–28. Los Angeles, CA
ment, for classroom instruction and can help students to achieve [20] Y. Chali, S.A. Hasan, On the effectiveness of using syntactic and shallow
progress faster. Students can use SAGE in the classroom as well semantic tree kernels for automatic assessment of essays, in: Proceedings
as at home, while learning. Feedback for each specific response re- of the International Joint Conference on Natural Language Processing, 2013,
pp. 767–773. Nagoya, Japan
turned by our system provides information on the quality of differ- [21] E.B. Page, Computer grading of student prose , using modern concepts and
ent aspects of writing, a score and a descriptive feedback. The sys- software, J. Exp. Educ. 62 (2) (1994) 127–142.
tem’s constant availability for scoring gives a possibility to students [22] O. Mason, I. Grove-Stephenson, Automated free text marking with paperless
school, in: Proceedings of the Sixth International Computer Assisted Assess-
to repetitively practice their writing at any time. SAGE is consistent
ment Conference, 2002, pp. 213–219.
as it predicts the same score for a single essay each time that essay [23] C.S. Rich, M.C. Schneider, J.M. D’Brot, Applications of automated essay evalua-
is input to the system. This is important since the scoring consis- tion in West Virginia, in: M.D. Shermis, J. Burstein (Eds.), Handbook of Auto-
mated Essay Evaluation: Current Applications and New Directions, Routledge,
tency between prompts turned out to be one of the most difficult
New York, 2013, pp. 99–123.
psychometric issues in human scoring [13]. [24] A. Fazal, T. Dillon, E. Chang, Noise reduction in essay datasets for automated
Advantages of automated feedback are its anonymity, instanta- essay grading, Lect. Notes Comput. Sci. 7046 (2011) 484–493.
neousness, and encouragement for repetitive improvements by giv- [25] S.M. Lottridge, E.M. Schulz, H.C. Mitzel, Using automated scoring to monitor
reader performance and detect reader drift in essay scoring., in: M.D. Shermis,
ing students more practice in writing essays [72]. By publicly pro- J. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Applica-
viding the technical details and results of our AEE system, we also tions and New Directions, Routledge, New York, 2013, pp. 233–250.
aim to promote the openness of this research field. Hopefully, this [26] M.D. Shermis, B. Hamner, Contrasting state-of-the-art automated scoring of es-
says: analysis, in: M.D. Shermis, J. Burstein (Eds.), Handbook of Automated Es-
shall open opportunities to progress and help bring more AEE sys- say Evaluation: Current Applications and New Directions, Routledge, New York,
tems into practical applications. 2013, pp. 313–346.
[27] M.I. Smith, A. Schiano, E. Lattanzio, Beyond the classroom., Knowl. Quest 42
(3) (2014) 20–29.
References [28] H. Chen, B. He, T. Luo, B. Li, A ranked-based learning approach to automated
essay scoring, in: Proceedings of the Second International Conference on Cloud
[1] M.D. Shermis, J. Burstein, Introduction, in: M.D. Shermis, J. Burstein (Eds.), Au- and Green Computing, Ieee, 2012, pp. 448–455.
tomated essay scoring: a cross-disciplinary perspective, Lawrence Erlbaum As- [29] L. Bin, Y. Jian-Min, Automated essay scoring using multi-classifier fusion, Com-
sociates, Manwah, NJ, 2003, pp. xiii–xvi. mun. Comput. Inf. Sci. 233 (2011) 151–157.
[2] E.B. Page, The imminence of...grading essays by computer, Phi Delta Kappan 47 [30] L.M. Rudner, T. Liang, Automated essay scoring using Bayes’ theorem, J. Tech-
(5) (1966) 238–243. nol. Learn. Assess. 1 (2) (2002) 3–21.
[3] T.K. Landauer, P.W. Foltz, D. Laham, An introduction to latent semantic analysis, [31] J.R. Christie, Automated essay marking - for both style and content, in: Pro-
Discourse Process. 25 (2–3) (1998) 259–284. ceedings of the Third Annual Computer Assisted Assessment Conference, 1999.
[4] T. Kakkonen, N. Myller, E. Sutinen, J. Timonen, Comparison of dimension re- [32] E. Mayfield, C. Rosé, LightSIDE: open source machine learning for text, in:
duction methods for automated essay grading, Educ. Technol. & Soc. 11 (3) M.D. Shermis, J. Burstein (Eds.), Handbook of Automated Essay Evaluation:
(2008) 275–288. Current Applications and New Directions, Routledge, New York, 2013,
[5] Y. Attali, A Differential Word Use Measure for Content Analysis in Automated pp. 124–135.
Essay Scoring, ETS Research Report Series, 36, 2011. [33] M.M. Islam, A.S.M.L. Hoque, Automated essay scoring using generalized latent
[6] P.W. Foltz, W. Kintsch, T.K. Landauer, The measurement of textual semantic analysis, J. Comput. 7 (3) (2012) 616–626.
coherence with latent semantic analysis, Discourse Process. 25 (2–3) (1998) [34] R. Williams, H. Dreher, Automatically grading essays with markitÂl’, Issues Inf.
285–307. Sci. Inf. Technol. 1 (2004) 693–700.
[7] P.W. Foltz, Discourse coherence and LSA, in: T.K. Landauer, D.S. McNamara, [35] M.A. Hearst, TextTiling: segmenting text into multi-paragraph subtopic pas-
S. Dennis, W. Kintsch (Eds.), Handbook of Latent Semantic Analysis, Lawrence sages, Comput. Ling. 23 (1) (1997) 33–64.
Erlbaum Associates, Inc., Mahwah, New Jersey, 2007, pp. 167–184. [36] E. Miltsakaki, K. Kukich, Automated evaluation of coherence in student essays,
[8] D. Higgins, J. Burstein, D. Marcu, C. Gentile, Evaluating multiple aspects of co- in: Proceedings of LREC-20 0 0, Linguistic Resources in Education Conf., Athens,
herence in student essays, in: Proceedings of HLT-NAACL, 2004. Boston, MA Greece, 20 0 0, 20 0 0, pp. 140–147.
[9] J. Burstein, J. Tetreault, S. Andreyev, Using entity-based features to model co- [37] B.J. Grosz, A.K. Joshi, S. Weinstein, Centering : a framework for modelling the
herence in student essays, in: Human Language Technologies: The 2010 Annual local coherence of discourse, Comput. Ling. 21 (2) (1995) 203–226.
Conference of the North American Chapter of the ACL, Association for Compu- [38] J.C. Burstein, J.R. Tetreault, M. Chodorow, D. Blanchard, S. Andreyev, Automated
tational Linguistics, Los Angeles, California, 2010, pp. 681–684. evaluation of discourse coherence quality in essay writing, in: M.D. Shermis,
[10] F. Gutiererz, D. Dou, S. Fickas, G. Griffiths, Online reasoning for ontology-based J.C. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Appli-
error detection in text, in: On the Move to Meaningful Internet Systems: cations and New Directions, Routledge, New York, 2013, pp. 267–280.
OTM 2014 Conferences Lecture Notes in Computer Science, 8841, 2014, [39] R. Barzilay, M. Lapata, Modeling local coherence : an entity-based approach,
pp. 562–579. Comput. Ling. 34 (1) (2008) 1–34.
[11] E. Brent, C. Atkisson, N. Green, Time-shifted collaboration: creating teach- [40] F. Gutierrez, D.C. Wimalasuriya, D. Dou, Using information extractors with the
able moments through automated grading, in: A. Juan, T. Daradournis, S. Ca- neural electromagnetic ontologies, in: Proceedings of the 2011th Confeder-
balle (Eds.), Monitoring and Assessment in Online Collaborative Environments: ated International Conference on On the Move to Meaningful Internet Systems
Emergent Computational Technologies for E-learning Support, IGI Global, 2010, (OTM’11), 2011, pp. 31–32.
pp. 55–73. [41] F. Gutierrez, D. Dou, S. Fickas, G. Griffiths, Providing grades and feedback for
[12] I.I. Bejar, A validity-based approach to quality control and assurance of auto- student summaries by ontology-based information extraction, in: Proceedings
mated scoring, Assess. Educ. 18 (3) (2011) 319–341. of the 21st ACM International Conference on Information and Knowledge Man-
[13] Y. Attali, Validity and reliability of automated essay scoring, in: M.D. Shermis, agement - CIKM ’12, 2012, pp. 1722–1726.
J.C. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Appli- [42] F. Gutierrez, D. Dou, A. Martini, S. Fickas, H. Zong, Hybrid ontology-based
cations and New Directions, Routledge, New York, 2013, pp. 181–198. information extraction for automated text grading, in: Proceedings of 12th
[14] D.M. Williamson, X. Xi, F.J. Breyer, A framework for evaluation and use of au- International Conference on Machine Learning and Applications, 2013,
tomated scoring, Educ. Meas. 31 (1) (2012) 2–13. pp. 359–364.
[15] M.D. Shermis, J. Burstein, S.A. Bursky, Introduction to automated essay evalua- [43] D.C. Wimalasuriya, D. Dou, Ontology-based information extraction: an intro-
tion, in: M.D. Shermis, J. Burstein, S.A. Bursky (Eds.), Handbook of Automated duction and a survey of current approaches, J. Inf. Sci. 36 (3) (2010) 306–323.
Essay Evaluation: Current Applications and New Directions, Routledge, New [44] F. Wu, D.S. Weld, Open information extraction using wikipedia, in: Proceedings
York, 2013, pp. 1–15. of the 48th Annual Meeting of the Association for Computational Linguistics,
[16] J. Burstein, J. Tetreault, N. Madnani, The E-raterÂő automated essay scoring 2010, pp. 118–127.
system, in: M.D. Shermis, J. Burstein (Eds.), Handbook of Automated Essay [45] P. Gamallo, An overview of open information extraction, in: M.J.A.V. Pereira,
Evaluation: Current Applications and New Directions, Routledge, New York, J.P. Lea, A. Simões (Eds.), Invited Talk at the 3rd Symposium on Languages,
2013, pp. 55–67. Applications and Technologies, SLATE14., 2014, pp. 13–16.
[17] P.W. Foltz, L.A. Streeter, K.E. Lochbaum, T.K. Landauer, Implementation and ap- [46] https://www.google.com/patents/US20140032209.
plications of the intelligent essay assessor, in: M.D. Shermis, J. Burstein (Eds.), [47] L.D. Corro, R. Gemulla, ClausIE: clause-based open information extraction, in:
Handbook of Automated Essay Evaluation: Current Applications and New Di- Proceedings of the 22nd International Conference on World Wide Web, Rio de
rections, Routledge, New York, 2013, pp. 68–88. Janeiro, Brazil, 2013, pp. 355–366.
[18] M.T. Schultz, The IntelliMetric automated essay scoring engine - a review and [48] H. Bast, E. Haussmann, Open information extraction via contextual sentence
an application to chinese essay scoring, in: M.D. Shermis, J.C. Burstein (Eds.), decomposition, in: Proceedings of the 7th International Conference on Seman-
Handbook of Automated Essay Evaluation: Current Applications and New Di- tic Computing (ICSC), Irvine, California, 2013, pp. 154–159.
rections, Routledge, New York, 2013, pp. 89–98.
132 K. Zupanc, Z. Bosnić / Knowledge-Based Systems 120 (2017) 118–132
[49] P. Gamallo, M. Garcia, S. Fernandez-Lanza, Dependency-based open informa- [62] R.C. Geary, The contiguity ratio and statistical mapping, Incorporated Stat. 5
tion extraction, in: Proceedings of the Joint Workshop on Unsupervised and (3) (1954) 115–145.
Semi-Supervised Learning in NLP (), Avignon, France, 2012, pp. 10–18. [63] A. Getis, J.K. Ord, The analysis of spatial association by use of distance statis-
[50] V. Punyakanok, D. Roth, The use of classifiers in sequential inference, in: Neu- tics, Geogr. Anal. 24 (3) (1992) 189–206.
ral Information Processing Systems 2001, Vancouver, British Columbia, 2001, [64] P. Cassidy, Toward an open-source foundation ontology representing the Long-
pp. 995–1001. man’s defining vocabulary: the COSMO ontology OWL version, in: Proceedings
[51] E. Bengtson, D. Roth, Understanding the value of features for coreference reso- of the Third International Ontology for the Intelligence Community Conference,
lution, in: Proceedings of the Conference on Empirical Methods in Natural Lan- Fairfax, VA, 2009.
guage Processing - EMNLP ’08, Waikiki, Honolulu, Hawaii, 2008, pp. 294–303. [65] G.A. Miller, Wordnet: a lexical database for english, Commun. ACM 38 (11)
[52] D. Chen, C.D. Manning, A fast and accurate dependency parser using neural (1995) 39–41.
networks, in: Proceedings of the 2014 Conference on Empirical Methods in [66] H. Peng, K.-W. Chang, D. Roth, A joint framework for coreference resolution
Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 740–750. and mention head detection, in: Proceedings of the 19th Conference on Com-
[53] T. Gruber, Ontology, in: L. Liu, M.T. Ozsu (Eds.), Encyclopedia of Database Sys- putational Natural Language Learning (CoNLL’15), Association for Computa-
tems, Springer-Verlag, 2009. tional Linguistics, Beijing, China, 2015, pp. 12–21.
[54] S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D.L. McGuinness, [67] C.D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S.J. Bethard, D. McClosky, The
P.F. Patel-Schneider, L.A. Stein, OWL web ontology language, Technical Report, stanford CoreNLP natural language processing toolkit, in: Proceedings of 52nd
2004. Annual Meeting of the ACL: System Demonstrations, Association for Computa-
[55] B. Motik, R. Shearer, I. Horrocks, Hypertableau reasoning for description logics, tional Linguistics, Baltimore, Maryland, 2014, pp. 55–60.
J. Artif. Intell. Res. 36 (2009) 165–228. [68] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python, O’Reilly
[56] L. Perelman, When “the state of the art” is counting words, Assess. Writing 21 Media, 2009.
(2014) 104–111. [69] G.K. Kanji, 100 Statistical Tests, 3rd ed., SAGE Publications, London, Thousand
[57] W.H. Dubay, Smart Language: Readers , Readability , and the Grading of Text, Oaks, New Delhi, 2006.
BookSurge Publishing, 2007. [70] P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Mach. Learn. 63
[58] C. Smith, A. Jönsson, Automatic summarization as means of simplifying texts, (1) (2006) 3–42.
an evaluation for swedish, in: B. Sandford Pedersen, G. Nešpore, I. Skadina [71] M.D. Shermis, H.R. Mzumara, J. Olson, S. Harrington, On-line grading of student
(Eds.), Proceedings of the 18th Nordic Conference of Computational Linguistics essays: PEG goes on the world wide web, Assess. Eval. Higher Educ. 26 (3)
NODALIDA 2011, 2011, pp. 198–205. (2001) 247–259.
[59] A. Mellor, Essay length , lexical diversity and automatic essay scoring, Mem. [72] S.C. Weigle, English as a second language writing and automated essay evalua-
Osaka Inst. Technol. 55 (2) (2011) 1–14. tion, in: M.D. Shermis, J.C. Burstein (Eds.), Handbook of Automated Essay Eval-
[60] P.J. Clark, F.C. Evans, Distance to nearest neighbor as a measure of spatial rela- uation: Current Applications and New Directions, Routledge, New York, 2013,
tionships in populations, Ecology 35 (4) (1954) 445–453. pp. 36–54.
[61] P.A.P. Moran, Notes on continuous stochastic phenomena, Biometrika 37 (1–2)
(1950) 17–23.