Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Chapter 4

An Improved Approach for Automated


Essay Scoring with LSTM and Word
Embedding

Dadi Ramesh and Suresh Kumar Sanampudi

Abstract Automatic essay scoring has been shown to be an effective mechanism


for quickly assessing student responses in the education system. It has already a wide
variety of applications to solve, but there are evaluating the essays based on statistical
features like Bag of Words (BoG), Term Frequency-Inverse Document Frequency
(TF-IDF). Some of the evaluating approaches are considering the features like Word
embedding with Glove, Word2Vec, One hot encoding. Both types of approaches are
not fulfilling essay evaluation and not able to retrieve semantic information from
essays. Here we are evaluating the essay with Word2Vec and Long Short-Term
Memory (LSTM) with K-Fold cross-validation and we got an accuracy of 85.35.

4.1 Introduction

A manually scoring student response like essays and long answers is a daunting
task for evaluators in the education system. In extensive exams where the number
of students is more than thousands, manual scoring requires a lot of time and no
reliability guarantee. But automated essay scoring (AES) system evolved to assess
student responses on a large scale based. Thereby computers reduced the human effort
in the assessment process. Advancement in artificial intelligence and natural language
processing has improved the progress in AES. So many researches concentrated on
AES to improve the accuracy and reliability of automation systems.
However, in earlier AES was implemented on statistical features like a number of
words, a number of sentences, length of sentences and the average length of sentences,
etc., to prepare vector and trained machine learning model, later researchers tried to

D. Ramesh (B)
Research Scholar in JNTU Hyderabad, School of Computer Science and Artificial Intelligence,
SR Univrsity, Warangal, India
S. K. Sanampudi
JNTUH College of Engineering Jagitial, Nachupally Jagtial dist, Telangana, India
e-mail: sureshsanampudi@jntuh.ac.in

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 35
V. Bhateja et al. (eds.), Evolution in Computational Intelligence,
Smart Innovation, Systems and Technologies 267,
https://doi.org/10.1007/978-981-16-6616-2_4
36 D. Ramesh and S. K. Sanampudi

retrieve style-based features like parts of speech, grammar, etc. For vector prepara-
tion. From the last decade, 2010–2020, researchers focused on content-based features
like coherence, cohesion, and completeness to assess the student responses. These
features generally say student writing genre and explanation of content to the prompt.
There are two challenges in AES majorly: the first is preparing the feature vector
from the essay set, second is training the feature vector with a proper machine learning
model. Preparing feature vectors from essay sets, including semantic information,
is a significant challenge in AES [1]. Many researchers have used many types of
text embedding (vector) [2, 3] methods, but no one fulfilled the essay’s semantic
vector. The embedding techniques are numbers from the essay, Bag of Words [4],
word frequency vector like Tf-Idf [4], Word2vec [5], and GloVec. But no technique
is preparing content-based feature vectors.
In machine learning models [6–8], the researchers have used the regression model,
support vector machine, and random forest, but these models are not sequenced to
sequence learning and don’t find the essay’s coherence. The deep learning models
convolutional neural network (CNN), recurrent neural network (RNN) in that CNN
models also do not learn sequence to sequence but are used for N-gram feature
extraction. RNN models are sequence to sequence learning models that will assess
the student response.
The rest of the paper is organized as follows: In Sect. 4.2, we explained the related
work, and in Sect. 4.3, we described our approach for AES, we explained results and
analysis in Sect. 4.4, and the conclusion in Sect. 4.5.

4.2 Related Work

The first AES system developed by Ajay et al. [9], the project essay grading extracts
the features like character count, word count, and sentence length to grade essays.
Foltz et al. [10] introduced an Intelligent Essay Assessor (IEA) by evaluating content
using latent semantic analysis to produce an overall score, but these systems failed
to retrieve the content-based features from essays the models trained on statistical
features.
Dasgupta et al. [11] implemented a sequence to sequence learning model to train
feature vectors. They used the glove to prepare feature vectors, and the vectors are
transferred to the CNN layer to retrieve local features. After CNN, they stacked the
RNN layer for the actual sequence to sequence training they embedded an activation
layer to predict the score of the essay.
Wang et al. [12] implemented bi-LSTM to train feature vectors and prepared a
feature by word-level encoding with Word2vec [5] library, which works on one-hot
encoding to prepare word vectors.
Kumar et al. [13] implemented an autosas for short answer scoring. They stacked
CNN and RNN layers in autosas. Retrieved various features like context, POS, prompt
overlap, etc.; based on these features, they classified an essay whether to recommend
4 An Improved Approach for Automated Essay Scoring … 37

or not. They assigned a score for individual features with that, and they classified the
essay.
Liu et al. [14] developed two stage learning frameworks for essay scoring. In the
first stage they found essay semantic score, coherence score, and prompt relevant
score. They found semantic score with BERT it is word-level encoding library and
by LSTM they calculated coherence and prompt relevant score. In the second stage
they concatenated all three scores and trained a xgboost to assign score for the essay.
Darwish et al. [15] predicted essay scores based on syntax and semantic features.
The syntax features were found with lexical analysis and parsing method and
semantic features predicted with Tf-IDF vector and predicted final score of the essay.
Zhu and Sun [16] used Glove for word embedding, trained the LSTM model to give
final scores, and retrieved some statistical features for the essay’s final score. Wang
et al. (2020). Implemented regression and classification layers both in Bi-LSTM for
essay scoring. They classified the essays based on the ground truth value.
Uto [17] implemented a stacked deep neural network with CNN and RNN. They
extracted the features lookup table layer, which will convert all words to vector, and
pulled N-gram level features with CNN with zero padding—finally trained LSTM
with sigmoid activation function to give the final score.

4.3 Method

We proposed single-dimensional LSTM [18] with K-fold cross-validation on the


ASAP dataset. In the AES system, one should assess the essay based on the content
and relevance to the prompt. For that, we used LSTM [18] recurrent neural network
model to train the essay word vector. LSTM [4] is a sequence to sequence learning
model which learns complex patterns from vectors. The Word vector is prepared
with the Word2Vec [5] NLP library. The word2vec will prepare a vector from essay
words into a minimum of 32 dimensions.

4.3.1 Data Set

We used ASAP publicly available dataset from kaggle competition which is largest
corpus for AES and used by maximum number of researches. It contains 8 prompts
each prompt has on an average on above 1500 essays. The human rater scores each
essay is in between 0 and 60. A detailed description of essay dataset is given in Table
4.1.
38 D. Ramesh and S. K. Sanampudi

Table 4.1 ASAP kaggle data


Essay set No. of essays Average length of Rating range
set
essays
1 1783 350 2–12
2 1800 350 1–6
3 1726 150 0–3
4 1772 150 0–3
5 1805 150 0–4
6 1800 150 0–4
7 1569 250 0–30
8 723 650 0–60

4.3.2 Proposed Model

Typically, AES system will have two components, namely, essay representation into
a vector, essay scoring. For vector representation of essays, first divided the entire
essay into sentence-wise and removed stop words with NLTK python library’s help
from the essay. From that, we prepared a list with all bow of each essay. Later
converted bow to vector form with Word2Vec [5] NLP library. Word2Vec converts
the words into a vector with a minimum of 32 dimensions for each word. Figure 4.1
will illustrate the complete AES scoring approach with LSTM [18].

Fig. 4.1 Proposed approach


4 An Improved Approach for Automated Essay Scoring … 39

Table 4.2 K-fold


K-Fold cross Validation QWK score
cross-validation QWK score
Fold-1 0.8489
Fold-2 0.8576
Fold-3 0.8628
Fold-4 0.8451
Fold-5 0.8529
Average 85.35

We developed LSTM [18] with two layers to train the essay vector. The LSTM
[18] is a deep learning network quite good for sequence to sequence learning. In an
essay all the sentences are connected to each other, so Essay scoring needs sequence
to sequence learning to predict the score of the essay. We prepared LSTM [18] with
pretrained hyper parameters like learning rate as 0.1, batch size is 64. We used a
recurrent dropout rate of 0.4, and the activation function is Relu. Calculated loss
with a mean square error, and we used rmsprop as the optimizer.
To train LSTM [18] with feature vectors first we divided the total essay set into
five groups (fold). Each fold contains a subset of the essay’s features vector. Out of
5 folds [19, 20], four folds for training and the remaining one-fold for testing. The
training process will repeat until all the folds are completed as testing and training.
Each time one-fold is for testing, and the remaining are for training. Hear each fold
trained for six iterations on the LSTM model [18].
Quadratic weighted Kappa (QWK) is used for comparing the predicted score with
the actual score. That is to find the mutual agreement between two raters: human
rater and system. The final QWK score we got after all iterations is 85.35.

4.4 Results and Analysis

The results of fivefold cross-validation with each fold QWK score are shown in Table
4.2. The mean QWK score we achieved with our AES system is 85.35. Our model
LSTM with K-fold cross-validation [19, 20] outperformed better than other models:
Tirthankar Dasgupta, Yaman Kumar, and all, which used various neural network
models like CNN, LSTM from 2018 to 2020, for AES on the Kaggle dataset. The
comparisons of our model with various models with results are shown in Table 4.3.

4.5 Conclusion and Future Work

In this work, we proposed LSTM with K-Fold cross-validation and retrieved features
with word2vec for scoring essays. The experiments on the publicly available ASAP
dataset and showed that our model outperformed the current approaches. According
40 D. Ramesh and S. K. Sanampudi

Table 4.3 Comparison of results


System Approach Dataset Features applied Results
Tirthankar CNN ASAP Kaggle Content and QWK 0.786
Dasgupta et al. -bidirectional physiological
[11] LSTMs neural features
network
Wang et al. [12] Bi-LSTM ASAP Kaggle Word embedding QWK 0.724
sequence
Yaman Kumarm Random Forest ASAP Kaggle Style and QWK 82%
et al. (2019) CNN, RNN short Answer content-based
neural network features
Jiawei Liu et al. CNN, LSTM, ASAP Kaggle semantic data, QWK 0.709
(2019) BERT handcrafted features
like grammar
correction, essay
length, number of
sentences, etc
Darwish et al. Multiple Linear ASAP Kaggle Style and QWK 0.77
[15] Regression content-based
features
Zhu and Sun [16] RNN (LSTM, ASAP Kaggle Word embedding, QWK 0.70
Bi-LSTM) grammar count,
word count
Uto(B) and Item Response ASAP Kaggle QWK 0.749
Okano [17] Theory Models
(CNN-LSTM,
BERT)
Proposed model LSTM with ASAP Kaggle Word2Vec, QWK 0.8535
multiple layers Sentence vector,
BoW

to the agreement between the machine learning model and human graders, compared
to other models, it is 5−7% improved, and we achieved the mean average QWK score
of 85.35.
The accuracy has increased with our model, and we extracted features from the
essay with word2vec, but it extracts word-level features, so there is a chance of
missing semantics of the essay in our model, which is an essential feature in essay
grading. We want to work on it in the future and assess the essay by considering
features like cohesion, coherence, and completeness.

References

1. Dadi R. et al.: Iop Conf. Ser.: Mater. Sci. Eng. 981, 022016 (2020)
2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words
4 An Improved Approach for Automated Essay Scoring … 41

and phrases and their compositionality. In: Proceedings of the 26th International Conference on
Neural Information Processing Systems, vol. 2, pp. 3111–3119 (NIPS’13). Curran Associates
Inc., Red Hook, NY, USA (2013)
3. Sheshikala, M. et al.: Natural language processing and machine learning classifier used for
detecting the author of the sentence. Int. J. Recent Technol. Eng. (IJRTE) (2019)
4. Dolamic, L., Savoy, J.: When stopword lists make the difference. J. Am. Soc. Inf. Sci. Technol.
61(1), 200–203 (2010)
5. Mikolov, T. et al.: Efficient Estimation of Word Representations in Vector Space. ICLR (2013)
6. Ramesh, D.: Enhancements of artificial intelligence and machine learning. Int. J. Adv. Sci.
Technol. 28(17), 16–23 (2019). Accessed from http://sersc.org/journals/index.php/ijast/article/
view/2223
7. Al, S.M. et al.: A comprehensive study on traditional ai and ann architecture. Int. J. Adv. Sci.
Technol. 28, no. 17, 479–87 (2019)
8. Al, S.N.P. et al.: Variation analysis of artificial intelligence, machine learning and advantages
of deep architectures. Int. J. Adv. Sci. Technol. 28(17), 488–95 (2019)
9. Ajay, H.B., Tillett, P.I., Page, E.B.: Analysis of essays by computer (AEC-ii) (no. 8–0102).
Washington, DC: U.S. Department of Health, Education, and Welfare, Office of Education,
National Center for Educational Research and Development (1973)
10. Foltz, P.W., Laham, D., Landauer, T.K.: The intelligent essay assessor: applications to educa-
tional technology. Interact. Multimed. Electron. J. Comput.-Enhanc. Learn. 1(2) (1999). http://
imej.wfu.edu/articles/1999/2/04/ index.asp.
11. Dasgupta, T., Naskar, A., Dey, L., Saha, R.: Augmenting textual qualitative features in deep
convolution recurrent neural network for automatic essay scoring. In: Proceedings of the 5th
Workshop on Natural Language Processing Techniques for Educational Applications, pp. 93–
102 (2018)
12. Wang, Y. et al.: Automatic essay scoring incorporating rating schema via reinforcement
learning. EMNLP (2018)
13. Kumar, Y., Aggarwal, S., Mahata, D., Shah, R.R., Kumaraguru, P., Zimmermann, R.: Get it
scored using autosas - an automated system for scoring short answers. AAAI (2019)
14. Liu, J., Xu, Y., Zhao, L.: Automated Essay Scoring Based on Two-Stage Learning (2019).
abs/1901.07744
15. Darwish S.M., Mohamed S.K.: Automated essay evaluation based on fusion of fuzzy ontology
and latent semantic analysis. In: Hassanien, A., Azar, A., Gaber, T., Bhatnagar, R.F., Tolba,
M. (eds.) The International Conference on Advanced Machine Learning Technologies and
Applications (2020)
16. Zhu W., Sun Y.: Automated essay scoring system using multi-model machine learning. In:
Wyld D.C. et al. (eds.) MLNLP, BDIOT, ITCCMA, CSITY, DTMN, AIFZ, SIGPRO (2020)
17. Uto M., Okano M.: Robust neural automated essay scoring using item response theory. In:
Bittencourt, I., Cukurova, M., Muldner, K., Luckin, R., Millán, E. (eds.) Artificial Intelligence
in Education. Aied 2020. Lecture Notes in Computer Science, vol 12163. Springer, Cham
(2020)
18. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997). https://doi.org/10.1162/neco.1997.9.8.1735
19. Refaeilzadeh, P., Tang, L., Liu, H.: Cross-validation. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia
of Database Systems. Springer, Boston, MA (2009). https://doi.org/10.1007/978-0-387-39940-
9_565
20. Yadav, S., Shukla, S.: Analysis of k-fold cross-validation over hold-out validation on colossal
datasets for quality classification. In: 2016 IEEE 6th International Conference on Advanced
Computing (IACC), pp. 78–83. Bhimavaram, India (2016). https://doi.org/10.1109/iacc.201
6.25.

You might also like