Paper 11

Experimenting with Latent Semantic Analysis and
Latent Dirichlet Allocation on Automated Essay

2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS) | 978-0-7381-1180-3/20/$31.00 ©2020 IEEE | DOI: 10.1109/SNAMS52053.2020.9336533
Grading
Jalaa Hoblos
Penn State University, Erie
Erie PA, USA
jxh83@psu.edu
Abstract—The demand of scoring natural language responses However, automatic essay grading is still evolving and the
has created a need for new computational tools that can be prediction rules to generate an accurate grade is still ongoing.
applied to automatically grade student essays. Systems for auto- In this paper, we introduce an essay grading system capable
matic essay assessment have been commercially available since
1990’s. However, the progress in the field was obstructed by a of comparing student’ essay to the professor answer-key and
lack of qualitative information regarding the effectiveness of such further assigning an appropriate grade based on semantic
systems. Most of the research in automatic essay grading has been similarity between the two. To predict similarity we use two
associated with English writing due to its widespread use and the modeling approaches: Latent Semantic Analysis (LSA) [12],
availability of more learner collection and language processing [13] and Latent Dirichlet Allocation (LDA) [14]. The essential
software for the language. In addition, there is large number
of commercial software for grading programming assignments problem in data assessment is called overfitting, i.e. using
automatically. In this work, we investigate document semantic a small data sample to make an accurate prediction [11],
similarity based on Latent Semantic Analysis (LSA) and on [15]. The grading software must compare essays, understand
Latent Dirichlet Allocation (LDA). We use an open-source Python what parts make sense and what do not, then it should make
software, Gensim, to develop and implement an essay grading an informant decision which constitutes the grade. We use
system able to compare an essay to an answer-key and assign to
it a grade based on semantic similarity between the two. We test Gensim [16] software to develop our system. Gensim is an
our tool on variable-size essays and conduct experimental tests open-source Python library designed to process raw data and
to compare the results obtained from human grader (professor) extract semantic topics. The core concepts in Gensim are:
and those obtained from the automatic grading system. Results Document, Corpus, Vector Space Model (VSM) [17], Model
show high correlation between the professor grades and the and Transformation. We will explain briefly each concepts but
grades assigned by both modeling techniques. However, LSA-
based modeling showed more promising results than the LDA- we refer the reader to [16] for more elaborated explanation.
based method. A token is a word and a document is an object of the text
Index Terms—Machine Learning, Latent Semantic Analysis, sequence. It can be short as one sentence or as big as a book.
Latent Dirichlet Allocation, Semantic Similarity The dictionary is created from the list of sequences from one
or more text files. It contains the mapping of all tokens to their
I. I NTRODUCTION unique id and will be used to create a bag-of-words corpus. In
the corpus, each document is represented by a vector including
Automatic grading programming assignments are available the frequency counts of each word in the dictionary vectors
and popular. Multiple systems are currently available to help omitting all vectors with value 0 to save space. Thus, a corpus
graders grade programming projects easier and faster [1]–[6]. is a collection of digital documents and served as input for
Automated essay grading is an important machine learning model training and model for extracting topics from new
application. Most of these systems are proprietary and evaluate documents i.e. documents that are not included in the trained
features of the text of each essay, such as the total number corpus. To be able to infer the latent structure in a corpus,
of words, the number of subsidiary clauses, or the ratio of each document is represented as a vector of features that can
uppercase to lowercase letters. The tendency in the earlier eras be manipulated mathematically. The Term Frequency inverse
is more towards rule-based methods, which either grade an- Document Frequency (TFiDF) [18], [19] is also a bag-of-
swers in parts with concept mapping techniques or holistically words model but unlike the regular corpus, it down weights
with information extraction techniques [7]–[9]. Later, the trend token that appears frequently across documents. Last, in Gen-
shifted more towards statistical methods, whereby the features sim the code and its data are referred to by a model. A model
are generated with the assistance of corpus-based methods, or is required to transform one document presentation (vector
Natural Language Processing (NLP) methods used as part of space) to another. By training the corpus, the parameters of
a machine learning system [10]–[12]. the transformation are learned. Gensim implements various
Authorized licensed use limited to: BOURNEMOUTH UNIVERSITY. Downloaded on July 01,2021 at 01:30:45 UTC from IEEE Xplore. Restrictions apply.
VSM algorithms but wee use two of them: Latent Semantic σi are the singular values of A with rank r and are listed in
Analysis (LSA) and Latent Dirichlet Allocation (LDA). The decreasing order as follows:
power of Gensim comes from the fact the a corpus does
not have to reside in RAM if it is very large. It is only σ1 ≥ σ2 ≥ σ3 .... ≥ σr ≥ 0 (6)
necessary that a corpus must be able to return one document
Note that some of the singular values are too small, thus they
vector at a time. Gensim accepts any object that, when iterated
are ignored and replaced by 0. Last, to reduce dimensionality,
over, successively yields documents. This paper is organized
LSA uses the truncated SVD. Therefore, instead of taking all
as follows: in Section II, we briefly explain LSA, LDA and
the singular values and their singular vectors, LSA takes the
TFiDF. In Section III, we present some related work. The data
k largest singular values and their corresponding vectors. To
processing and semantic similarity algorithms are explained in
compute the truncated SVD, we calculate the inverse of matrix
Section IV. Sections V and VI cover the experiments setups
A computed in Equation 3:
and the obtained results. Last Section VII discusses some open
issues and future work. A−1 = U tS T (7)
II. T OPIC M ODELING A LGORITHMS where,
Topic modeling is a technique used to extract abstract topics 1
tii ≡ (8)
from a collection of documents. In this Section, we summarize Σii
three topic modeling methods we use in this work. These values are the largest non-negative elements, and all
other elements of t are 0. Using truncated SVD, the underlying
A. Latent Semantic Analysis latent structure is represented in reduced-k dimensional space.
LSA is a computational text analysis tool that constructs In the context of information retrieval, LSA is called La-
a semantic space form a corpus of text which in turn is tent Semantic Indexing (LSI)1 . LSI computes how frequently
used to compute the similarity between words, sentences, and words occur in the corpus and assume that similar documents
entire documents. LSA uses advanced matrix algebra method contain approximately the same distribution of word frequen-
called Singular Value Decomposition (SVD) [20] to factorize cies for certain words where each document is treated as a
matrices. The semantic space is a high-dimensional vector bag-of-words. The most used method for computing word
space that is not too useful to humans, thus additional methods frequencies is the TFiDF method. TFiDF produces vectors of
are needed to create an examinable structure. The results of the comparable texts, which contain one element per word in
LSA can be used as input for further algorithmic processing the vocabulary. These vectors are then estimated to a number
to understand the similarity values in a different way. of topics with LSI.
Formally, LSA works as follows: Suppose A is a m × n
term-document matrix of a collection of documents, where B. Latent Dirichlet Allocation
each column in A is a document. The dimensions m and n LDA is a topic modeling algorithm based on statistical
represent the number of words and documents respectively. If method to analyze words of texts to discover the patterns
term k occurs n times in document l then A[k, l] = n. Note that run through them and how they are connected. LDA
that is a probabilistic approach where documents are represented
B = AT A (1) as random mixtures over latent topics, where each topic is
defined by a distribution over words. Multiple variations of
is the m × m document-document matrix and LDA has been developed over the years. Supervised LDA
C = AAT (2) methods enable the user to specify some topics and the corpus
analysis seeds to include these seeded topics in its overall
is the n × n term-term matrix. Obviously, both matrices B probabilistic model. The main difference between LSA and
and C are square and symmetric. Later, the Single Value LDA is that LDA assumes that the distribution of topics in a
Decomposition (SVD) is performed on A using B and C as document and the distribution of words in topics are Dirichlet
follows: distributions [21]. LSA does not assume any distribution.
A = SΣU T (3) The Dirichlet distribution is a family of continuous multi-
variate probability distribution for a discrete probability distri-
where S is the matrix of eigenvectors of B, U is the matrix bution for k categoriesPkx = {x1 , x2 , ..., xk }, where 0 ≤ xi ≤ 1
of the eigenvectors of C and Σ is the diagonal matrix of the for i ∈ [1, k] and i=1 xi = 1, denoted by k parameters
singular values obtained as the square roots of the eigenvalues α = {α1 , α2 , ...αk }, where (αi > 0) and k ≥ 2. Formally,
λi of B, i.e. p
σi = λi (4) 1
Dir(α) = f (p; α) = Πk xαi −1 (9)
B(α) i=1 i
and
Σ = diag(σi , ....σr ) (5) 1 We will be using LSI and LSA interchangeability in this work.
where B(α) is the multivariate beta function [22] and is given The authors in [34] proposed a software system that pro-
by: vides a new approach to automated marking based on the
Πk Γ(αi ) Stanford Parser [35]. The system is limited to short-answer
B(α) = i=1 (10)
Pk
Γ( i=1 αi ) questions only.
The authors in [36] introduced multiple LDA methods that
Γ is the gamma function [23] which is a generalization of the can be applied to the question answering problem. They
factorial function to non-integral values. To extend the factorial showed that estimating the similarity between a query and
for any real number x > 0, Γ is defined as: essays helps improve the retrieval performance by using topic
Z ∞ structures.
Γ(x) = tx−1 e−t dt (11) In [37] introduced a new statistical model for the question
0 answering problem in community archives which can be used
Gamma function has the following property: for tagging questions and answers. It uses topic models to
deduce question and answer representations in topic space and
Γ(x + 1) = xΓ(x) (12) retrieving based on these representations.
C. Term Frequency-Inverse Document Frequency In [38] the authors investigated two types of similarity
measures based on LDA, one based on word-to-word and an-
TFiDf is a statistical quantifier to evaluates how relevant other based on sentence-level. The LDA-based solutions were
a word is to a document in a collection of documents. Each compared to LSA-based methods. Their results showed that
word is assigned a weight which signifies its importance in the word-to-word LDA-based measure outperforms LSA when
document. It checks how relevant the word is throughout the combined with the optimal matching method but this was not
corpus. TF (Term Frequency) is the number of times a word the case of sentence-level.
occur in a document. The iDF (inverse Document Frequency) The authors in [39] showed that a method to validate LDA
of a word is the measure of how significant the word is in the algorithm combined with cosine similarity works differently
entire corpus. TFIDF is shown in Equation 13 based on the quality of the documents.
In [40] two different approaches to compute semantic sim-
D
Wi,j = tfij ∗ log2 (13) ilarity between two documents based on LSA. The results
dfi showed that for sentence-level texts combined with either IDF
where tfij is the number of occurrences of i in j, dfi is the or binary weighting worked best. For paragraph-level texts, a
number of documents containing i, and D is the total number log-type local weighting in combination with binary global
of documents (corpus). weighting works best. They concluded that global weights
have a greater impact for sentence-level similarity.
III. R ELATED W ORK In [41], the authors defined several semantic similarity
LSA has been widely used to compare semantic similarities measures based on the topic and word distributions in LDA.
as shown in [11]. In the paper, the authors summarized However, they modeled the problem of semantic similarity as
various LSA based techniques capable of making comparisons a binary decision problem, where a student response can be
between instructional sources and expository student writing. either correct or false.
Most of these techniques are based on pedagogical aspect of IV. DATA P ROCESSING AND S EMANTIC S IMILARITY
the system. A LGORITHM
In [10], the authors used LSA based interactive tutor called
The content on how to compute semantic similarity between
Select-a-Kibitzer. The system gives students feedback on their
documents is a bit vague in Gensim, thus we summarize it in
essays in a unique way. Select-a-Kibitzer features an array of
the form of steps as shown in Algorithm 1
human-like agents, or kibitzers, where each kibitzer behaves
as a critic for a specific attribute, such as, style, grammar or Algorithm 1 Semantic Similarity Algorithm
semantic. 1: Create the dictionary as shown in Algorithm 2
Foltz et al [15] described an automated essay grader and 2: Acquire the number of features based on dictionary
critic using LSA. The system maps student work to a body 3: Acquire the corpus based on dictionary
of data and identifies a wide range of acceptable semantic 4: Use TFiDF model to process corpus and obtain index
possibilities in the student responses. Once the student work 5: Transform document from TFiDF-weighted space into a
is compared to the data, the grader can return a grade or latent space of a lower dimensionality
comments. 6: Convert the answer-key to LSI/LDA space
In [33], the authors proposed an automatic short answer 7: Transform corpus to LSI/LDA space and index it
grading and feedback system. They used clustering using word 8: Perform a similarity query against the corpus {For similar-
frequencies. However, the system did not detect synonyms ity between vectors, Gensim uses the Cosine Similarity [24]}
of the words in the model answer, thus causing grading
discrepancies between the human graders and that of the Since the data is the core of any machine learning algorithm
system. and for it to be useful and better understood, we need first to
prepare it for machine learning. There are multiple ways to
make data ready for machine learning algorithms. Below is
the algorithm we use to process the documents and create the
dictionary:
Algorithm 2 Creating a Dictionary

1: Convert all documents to lower-case
2: Filter all words from punctuation
3: Form Bigram [25] of words for processing
4: Use lemmatization [26] to resolve words to their canonical
form
5: Remove all stop words, digits and words with length less
than three characters
6: Filter out words that occur only once
Fig. 1. Manual vs. Automated Grades Using Given Keywords (LSI)
We train each model using six large books about Pro-

gramming Language Concepts/Principles [27]–[32] in addition
to all students essays (118 essays). Some of the essays are
more than two pages long and some are as shown as couple
sentences. The corpus is used to train a machine learning
model and the models use the corpus to initialize parameters
for the model.
V. E XPERIMENTS S ETUP
The tool is used in our Programming Language Concepts
class, which is an undergraduate senior class. We ask stu-
dents to summarize section 10.4 in Sebesta’ book [31]. The
professor provides an expert ”answer key” where all essays
are compared to. We test the tool on two experiments. In the
first experiment, we give the students a bunch of keywords
Fig. 2. Correlation between Manual vs. Automated Grades when using LSI
and ask them to create the summary using these keywords. In With Keywords
the second experiment, we ask them to summarize the section
without giving them any keywords.
All essays are graded manually by the professor and we run With keywords
•
them all through the grading system where each one is giving Without keywords
•
a holistic grade. We then compute the correlation between the 2) Latent Dirichlet Allocation
grades assigned by the professor and those returned by the
• With keywords
system. Higher correlation numbers mean higher accuracy of
• Without keywords
our grading system.
a) with keywords: This experiment is conducted over two 3) Last we compute the correlation between all grades of
semesters, fall of 2018 and summer of 2019. The total number each modeling method. i.e. grades(w/o keywords) of all
of students who participated in both sections is 57. The fall students (118) over each modeling technique.
class is resident, but the summer class is online.
A. Latent Semantic Indexing
b) Summary without keywords: We repeat the experiment
but this time is without given the students any keywords. Since the LSI is based on linear algebra, the results of the
Students are asked to read the section and summarize it simulation come out the same each time we run it on the same
without any guidelines. This experiment is conducted over data. Here are the results:
two-section resident classes in fall of 2019. The number of a) Summary with keywords: The grades returned by
students who participated in the experiment is 61. the system when using LSI and the grades assigned by the
professor (manual) are shown in Figure 1. The correlation
VI. E XPERIMENTS R ESULTS between the professor versus the automated grades is 84.05%
In this section, we summarize the results of the two experi- and the coefficient of determination R2 is 70.65% as indicated
ments by running the grading system using the LSI and LDA in Figure 2.
modeling. Below are the experiments that we run to test our b) Summary without keywords: Manual and automated
grading system. grades are shown in Figure 3. The correlation between both
1) Latent Semantic Indexing grades is 90.9% and R2 is 82.62% as shown in Figure 4.
Fig. 3. Manual vs. Automated Grades Without Given Keywords (LSI) Fig. 6. Manual vs. Automated Grades for all Sections and Both Experiments
(LSI)
Fig. 4. Correlation between Manual vs. Automated Grades when using LSI
without Keywords) Fig. 7. Manual vs. Automated Grades Using Given Keywords (LSI)
B. Latent Dirichlet Allocation

c) Comparison of Both Experiments: We plotted grades
obtained by the professor and those obtained by the system Since LDA is a generative probabilistic model, LDA uses
for both experiments as shown in Figure 6. The correlation randomness in both training and inference steps. To stabilize
between both grades for both experiments is given by 87.9% the topic generation, we reset the random seed to the same
and R2 is 77.33% as shown in Figure 5. value every time the model is trained, or inference is per-
formed.
a) Summary with keywords: The grades returned by the
system when using LDA and the grades assigned by the
professor (manual) are shown in Figure 7. The correlation
between the professor versus the automated grades is 73.42%
and R2 is 53.91% as indicated in Figure 8.
b) Summary without keywords: Manual and automated
grades are shown in Figure 9. The correlation between both
grades is 77.55% and R2 is 60.14% as shown in Figure 10.
c) Comparison of Both Experiments: We plotted grades
obtained by the professor and those obtained by the system
for both experiments as shown in Figure 11. The correlation
obtained is ≈ 76.8% and R2 is 58.97% as shown in Figure
12 .
VII. D ISCUSSION AND F UTURE W ORK
Fig. 5. Correlation between Manual vs. Automated Grades when using LSI In this work, we develop and test an automated essay
W/O Keywords grading system based on LSA and LDA modeling techniques.
Fig. 11. Manual vs. Automated Grades for all Sections and Both Experiments
Fig. 8. Manual vs. Automated Grades for all Sections and Both Experiments
(LDA
Fig. 12. Correlation between Manual vs. Automated Grades for all Sections
and Both Experiments (LDA)
The system works on long and short essays. The purpose of

Fig. 9. Manual vs. Automated Grades Without Given Keywords (LDA) the system is to help instructors grade large numbers of essays
in shorter time. The tool shows relatively high correlation be-
tween the grades it returns, and the professor assigned grades.
Preliminary results yield better accuracy when using LSI-
based technique. However, we should emphasize that at this
point of the research, with only 118 essays, it is premature to
adopt the tool as a substitution for human grader. In addition,
the corpus we train is not very large. Hence, much more testing
is needed on more samples and on larger corpus. For future
work, we will be continuing on testing the system on more
data and larger corpus. In the upcoming academic year, we are
offering the course in question in both semesters (two sections
in the fall and one in the spring), thus collecting more samples.
This will help us infer better conclusion on the efficiency of
the system. We are also planning on testing the system on
other modeling transformations such as Random Projections
[42] and Hierarchical Dirichlet Process [43] to compare which
modeling techniques give more accurate results.
Fig. 10. Correlation between Manual vs. Automated Grades when using LDA R EFERENCES
Without Keywords
[1] “The future of automated grading is here.” [Online]. Available:
https://autogradr.com/
[2] “Stepik is an educational platform for computer science.” [Online]. [28] M. Balaban, Principles of Programming Languages, May
Available: https://stepik.org/ 2017 (accessed February 3, 2014). [Online]. Available:
[3] “Your online Teaching Assistant INGInious.” [Online]. Available: https://www.cs.bgu.ac.il/ mira/ppl-book-full.pdf
https://inginious.org/ [29] E. Boiten, “Concepts in programming languages, by john c. mitchell,
[4] “Autolab Project.” [Online]. Available: https://github.com/autolab cambridge university press, 2002, isbn 0-521-78098-5,” Journal of
[5] “Automatically grading programming homework.” [Online]. Avail- Functional Programming, vol. 13, no. 6, p. 1087–1088, 2003.
able: http://news.mit.edu/2013/automatically-grading-programming- [30] S. Krishnamurthi, Programming Languages: Application and Interpre-
homework-0603 tation., 2007.
[6] S. Srikant and V. Aggarwal, “A system to grade computer programming [31] R. W. Sebesta, Concepts of Programming Languages, 10th ed. Pearson.
skills using machine learning,” in Proceedings of the 20th ACM [32] T. Pratt and M. V. Zelkowitz, Programming Languages: Design and
SIGKDD International Conference on Knowledge Discovery and Data Implementation (4th Ed.). USA: Prentice-Hall, Inc., 2001.
Mining, ser. KDD ’14. New York, NY, USA: ACM, 2014, pp. 1887– [33] N. Suzen, A. N. Gorban, J. Levesley, and E. M. Mirkes, “Automatic
1896. [Online]. Available: http://doi.acm.org/10.1145/2623330.2623377 short answer grading and feedback using text mining methods,” ArXiv,
vol. abs/1807.10543, 2018.
[7] “About the e-rater Scoring Engine.” [Online]. Available:
[34] C. J. Harrison, R. Siddiqi, and R. Siddiqi, “Improving teaching and
https://www.ets.org/erater/about
learning through automated short-answer marking,” IEEE Transactions
[8] “Intellimetric.” [Online]. Available: http://www.intellimetric.com/direct/ on Learning Technologies, vol. 3, no. 03, pp. 237–249, jul 2010.
[9] “Automated essy scoring, measurement incorporated.” [Online]. Avail- [35] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard,
able: http://www.measurementinc.com/products-services/automated- and D. McClosky, “The Stanford CoreNLP natural language
essay-scoring processing toolkit,” in Association for Computational Linguistics
[10] P. Wiemer-Hastings and A. C. Graesser, “Select-a-kibitzer: A computer (ACL) System Demonstrations, 2014, pp. 55–60. [Online]. Available:
tool that gives meaningful feedback on student compositions,” http://www.aclweb.org/anthology/P/P14/P14-5010
Interactive Learning Environments, vol. 8, no. 2, pp. 149–169, 2000. [36] A. Celikyilmaz, D. Hakkani-Tur, and G. Tur, “LDA based similarity
[Online]. Available: https://doi.org/10.1076/1049-4820(200008)8:2;1- modeling for question answering,” in Proceedings of the NAACL
B;FT149 HLT 2010 Workshop on Semantic Search. Los Angeles, California:
[11] T. K. Landauer and J. Psotka, “Simulating text understanding for Association for Computational Linguistics, Jun. 2010, pp. 1–9. [Online].
educational applications with latent semantic analysis: Introduction to Available: https://www.aclweb.org/anthology/W10-1201
lsa,” Interactive Learning Environments, vol. 8, no. 2, pp. 73–86, 2000. [37] Z. Zolaktaf, F. Riahi, M. Shafiei, and E. Milios, “Modeling community
[Online]. Available: https://doi.org/10.1076/1049-4820(200008)8:2;1- question-answering archives,” 05 2012.
B;FT073 [38] V. Rus, N. B. Niraula, and R. Banjade, “Similarity measures based on
[12] T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to latent latent dirichlet allocation.” vol. 7816. Springer, 2013, pp. 459–470.
semantic analysis,” Discourse Processes, vol. 25, no. 2-3, pp. 259–284, [Online]. Available: http://dblp.uni-trier.de/db/conf/cicling/cicling2013-
1998. [Online]. Available: https://doi.org/10.1080/01638539809545028 1.html#RusNB13
[13] T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch, Handbook [39] A. C. Graesser, K. VanLehn, C. P. Rosé, P. W. Jordan, and
of latent semantic analysis. Psychology Press, 2013. D. Harter, “Intelligent tutoring systems with conversational dialogue,”
[14] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty, “Latent dirichlet AI Mag., vol. 22, no. 4, pp. 39–51, Oct. 2001. [Online]. Available:
allocation,” Journal of Machine Learning Research, vol. 3, p. 2003, http://dl.acm.org/citation.cfm?id=567363.567366
2003. [40] M. Lintean, C. Moldovan, V. Rus, and D. McNamara, “The role of
[15] P. W. Foltz, S. Gilliam, and S. Kendall, “Supporting content-based local and global weighting in assessing the semantic similarity of texts
feedback in on-line writing evaluation with lsa,” Interactive Learning using latent semantic analysis,” in Proceedings of the 23rd International
Environments, vol. 8, no. 2, pp. 111–127, 2000. [Online]. Available: Florida Artificial Intelligence Research Society Conference, FLAIRS-23,
https://doi.org/10.1076/1049-4820(200008)8:2;1-B;FT111 Oct. 2010, pp. 235–240.
[16] “gensim: topic modeling for humans.” [Online]. Available: [41] N. Niraula, R. Banjade, D. Ştefănescu, and V. Rus, “Experiments with
https://radimrehurek.com/gensim/ semantic similarity measures based on lda and lsa,” in Statistical Lan-
[17] M. Melucci, VectorSpace Model, L. LIU and M. T. ZSU, Eds. Boston, guage and Speech Processing, A.-H. Dediu, C. Martı́n-Vide, R. Mitkov,
MA: Springer US, 2009. and B. Truthe, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,
2013, pp. 188–199.
[18] G. Salton and C. Buckley, “Term weighting approaches in automatic
[42] E. Bingham and H. Mannila, “Random projection in dimensionality
text retrieval,” Information Processing and Management, vol. 24, no. 5,
reduction: applications to image and text data,” in Proceedings of
pp. 513 – 523, 1988.
the seventh ACM SIGKDD international conference on Knowledge
[19] H. P. Luhn, “A statistical approach to mechanized encoding and search- discovery and data mining, 2001, pp. 245–250.
ing of literary information,” IBM J. Res. Dev., vol. 1, no. 4, pp. 309–317, [43] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Sharing clusters
Oct. 1957. among related groups: Hierarchical dirichlet processes,” in Advances in
[20] G. H. Golub and C. Reinsch, “Singular value decomposition and least neural information processing systems, 2005, pp. 1385–1392.
squares solutions,” Numer. Math., vol. 14, no. 5, pp. 403–420, apr
1970. [Online]. Available: http://dx.doi.org/10.1007/BF02163027
[21] T. P. Minka, “Estimating a dirichlet distribution,” Tech. Rep., 2000.
[22] D. Riddhi, “Beta function and its applications,” The University of
Tennesse, Knoxville, USA.[online] Available from: http://sces. phys. utk.
edu/moreo/mm08/Riddi. pdf, 2008.
[23] E. Artin, The gamma function. Courier Dover Publications, 2015.
[24] J. Han, M. Kamber, and J. Pei, 2 - Getting to Know Your
Data, third edition ed., ser. The Morgan Kaufmann Series
in Data Management Systems, J. Han, M. Kamber, and
J. Pei, Eds. Boston: Morgan Kaufmann, 2012. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/B9780123814791000022
[25] S. Kapadia, “Language Models: N-Gram,” Aug. 2019.
[Online]. Available: https://towardsdatascience.com/introduction-to-
language-models-n-gram-e323081503d9
[26] “Stemming and lemmatization.” [Online]. Avail-
able: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-
lemmatization-1.html
[27] C. Ghezzi and M. Jazayeri, Programming Language Concepts (2nd Ed.).
USA: John Wiley and Sons, Inc., 1986.

Paper 11

Uploaded by

Copyright:

Available Formats

You might also like

Paper 11

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper 11

Uploaded by

Copyright:

Available Formats

Experimenting with Latent Semantic Analysis and

Latent Dirichlet Allocation on Automated Essay

Algorithm 2 Creating a Dictionary

We train each model using six large books about Pro-

B. Latent Dirichlet Allocation

The system works on long and short essays. The purpose of

You might also like