Sobek: A Text Mining Tool To Support Reading Comprehension

You might also like

Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 14

Sobek: a text mining tool to support reading comprehension

Daniel Epstein, Marcio Bigolin, Eliseo Reategui

Graduate Program of Computers in Education

Federal University of Rio Grande do Sul (UFRGS)

Porto Alegre – Brazil

daepstein@gmail.com; marcio.bigolin@gmail.com, eliseoreategui@gmail.com

Abstract

This paper presents a mobile version of a text mining tool created mainly for
educational applications. The tool, called Sobek, is capable of extracting relevant
concepts from a text as well as providing a concise view of the relationships between
those concepts. The article presents results of an experiment carried out to validate the
tool. The experiment, based on Cohen's Kappa Coefficient and on the use of a gold
standard list provided by a group of 6 experts, showed that the tool was able to extract
terms and relationships from a given text that were accepted as highly accurate by the
experts.

Keywords: text mining, reading, literacy, graphic representation.

Introduction

Text mining is a research field that encompasses different approaches such as


information retrieval, natural language processing, information extraction, text
summarization, supervised and unsupervised learning, probabilistic methods, text
streams and social media mining, opinion mining and sentiment analysis [1]. It aims to
retrieve relevant information from semi-structured or nonstructural data (Feldman and
Sanger 2006). It's a field in constant development, mostly due to the ever increasing
amount of information that can be found in the Internet. In this research, we have
focused specifically on the development of a mobile version of a text mining tool that
has been used mostly in educational applications in the past. The tool, called Sobek, is
capable of retrieving important information from texts and representing it graphically.
Having been developed for educational uses, Sobek has always had the requirement that
it should be simple to use, powerful and versatile. We show in this paper how Sobek has
been adapted for mobile devices, with a focus on educational applications. We also
present validation results that demonstrate the accuracy of the mining tool when
compared with human experts.

Text Mining and Education

From books to personal blogs and emails, texts often have the need to be classified and
examined. The process is usually statistical, sometimes relying on semantic and
syntactic information that highlights important or recurrent concepts from a text. Some
examples of text mining use include patent analyzes (Tseng et al 2007; ), e-commerce
(Pang and Lee 2008), bioinformatics (Krallinger and Valencia 2005), predicting stock
price changes (Mittermayer 2004; Wuthrich et al 1998; Geva and Zahavi 2010),
comparing and assessing patient conditions in medical centers (Damasceno 2014).

In the field of Education, text mining has become more popular especially with the advent of
distance learning and Massive Open Online Courses (Shatnawi et al., 2014). It has been used,
for exemple, in the analysis of student online interaction, showing that the combination of text
and data mining, applied to a large data set, can reveal relevant information about students'
behavior (He, 2013). Text mining has also been used to conduct formative assessment and let
learners and instructors to visualize results, providing an alternative solution to evaluate
learners’ performance throughout the learning process (Hsu et al., 2011). Researchers have
also evaluated how the mining of student responses from a survey could yield relevant
information for management purposes (Yu et al., 2011), and contrasted the results obtained
through the mining of students' opinions about teacher leadership with those of human raters
(Xu & Reynolds, 2012).

In the field of educational, TM has become more popular especially with the advent of
Massive Open Online Courses (Shatnawi, Gaber and Cocea, 2014). Researchers have
also evaluated how the mining of student responses from a survey could yield relevant
information for management purposes (Yu, DiGangi and Jannasch-Pennell, 2011), and
contrasted the results obtained through the mining of students' opinions about teacher
leadership with those of human raters (Xu and Reynolds, 2012). Other educational uses
of TM include the evaluation of students' posts in discussion forums (Azevedo 2014),
and the analysis of collaborative writing assignments to support teachers' work (Macedo
et al., 2009). However, none of the works listed above focused on the use of text mining
techniques to support learning processes. Pinho et al. (2013), have done it, proposing
the use of text mining to monitor and provide guidance in students' online conversation,
in the context of foreign language learning (Pinho et al., 2013). Reategui et al (2012)
have also proposed the use of text mining to support summarization tasks, and
fanfiction writing (Campelo and Reategui, 2012). Here, our focus has been different in
that we were interested in understanding how the use of such tool could support reading
comprehension and help students understand concepts in the field of Science. Our study
has been based on the fact that important information is often overlooked by non-
proficient readers (Winograd, 1983). TM may provide relevant cues about topics in a
reading material that are likely to be important.

However, being able to retrieve information from texts is only useful if the information
retrieved is properly presented. Ambiguous or unclear representation may confuse the
users, hindering their understanding of a text, instead of clarifying it. Research results
have shown that graphs are a suitable way to represent information extracted from texts
because of their simple organization and interpretation, in which nodes represent
concepts/ideas and connections represent relationships between them (Chein and
Mugnier 2008). Another important aspect lies in the fact that connections between
nodes may represent different information, helping users to better understand the
visualization of the data extracted.

From an educational perspective, there are several ways to represent information


visually and it has been argued that non-linguistic representations may provide
additional help to students in reading and writing task (Marzano et al 2001). David
Hyerle (2009) tried to demonstrate how different types of visual tools, called graphic
organizers, could help students and teachers represent information and communicate
with others. These graphical representations have been applied across a large range of
subject areas, demonstrating their benefits in different activities such as mapping cause
and effect, note taking, comparing and contrasting concepts, organizing problems and
solutions, and relating information to main ideas or themes (Hall 2002). In the work
presented in this article, we propose the use of a particular text mining tool extract and
represent visually relevant terms from texts, helping students in reading comprehension
tasks.
Sobek Text Miner

Sobek is a text miner that has been developed as part of this project. It is capable of
analyzing a text and providing a graphical representation of its most relevant terms and
relationships. The algorithm used in the development of Sobek has been originally
designed by Schenker (2003) and later modified by Reategui and Epstein (2012) to fit it
for educational purposes. The analysis of plain text is Sobek's simplest operation. The
text to be analyzed can be copied and pasted in the tool's text editor or it can be loaded
from a file in TEX, PDF or DOC formats.

Sobek's operation can be divided into three stages. The first one is the identification of
relevant concepts in the text and summarizing them. The second step is related to the
identification of relationships among those concepts, and the last one concerns the
visual representation of the information extracted in the form of a graph. A more detail
description is provided in the following subsections.

Identifying key terms in a text

In the first step in Sobek's mining algorithm, a text T is split into a set of words W, using
spaces and punctuation marks as dividers. The set of words W is then mapped into terms
that may consist of a single word (called here a “single term”) or many words (called
“compound term”). This mapping is a statistical process that considers the frequency τ
with which each word is found in the text. When a subset of words w n ∈ W is repeated
in W (i. e. wj, wj+1 and wj+2} with a certain frequency, a compound term is formed (e.g.
"Global Warming"). Once those compound terms are found, the combination of words
that created it are removed from the word list and the words remaining are considered
single terms. A word may appear in both a compound term and a single term, as long as
Sobek identifies it appearance in the text individually as well as in the compound term.

In order to identify whether a word should be part of a compound term or if it should


figure in the list of single terms, each word w ∈ W is combined with n subsequent
words to create a set S of terms:

S = {wi, wi ∪ wi+1, wi ∪ wi+1 ∪ wi+2, ..., , wi ∪ wi+1 … ∪ wi+n}

For instance, the sequence of words 'AA BB CC' in a scenario where n=3 may be used
to create the following set of strings: S = {'AA'; 'AA BB'; 'AA BB CC'; 'BB'; 'BB CC';
'CC'}. Although the value of n could be higher than 3, terms with more than 3 words are
not very frequent and the computation required to identify them could not justify the
benefits.

Once the process of term identification is finished, the elements in S whose frequency
are higher than a minimum value ϕ are selected for further consideration. The frequency
ϕ is determined as a threshold considering that the returning set of terms Rc has a
minimum size ε. The minimum value accepted for ϕ is 2; otherwise all the words of any
text would feature in the resulting graph.

ϕ= ¿ 2 ,∧¿ Rc∨¿ ε
{
¿ 2 ,∧¿ Rc∨≥ ε

During the process of identification of terms, three functions are used to remove terms
and words that do not add information to the graph. The first one is the removal of stop
words. These words are mainly articles and prepositions that do not have any specific
meaning. The second function is called stemming and it is used to reduce redundancy
and remove terms with the same meaning and/or similar spelling. The third function is
related to the identification of synonyms by using a thesaurus, which enables Sobek to
further prune terms that have a similar meaning from the resulting graph.

After the identification of terms that have a frequency τ greater than ϕ, all other words
are ignored. Although the number of terms returned from the mining (ε) can be
determined by the user, according to Novak and Cañas (2006), no more than 25 terms
should be necessary to identify the central idea of a text. Based on this assumption,
Sobek's default settings is ε = 20. There are several ways to change the value of ε,
including selecting arbitrary values for ϕ or selecting different graph sizes, in which
case Sobek will automatically adjust the value of ε .

Identifying relationships between concepts

Sobek's second step is to identify relationships between terms. Each term c selected by
Sobek during the mining process belongs to a set C of all terms. A relationship between
ci and cj implies that there is a connection between them and that they are closely related
in T. It could represent different types of relationships, such as cause and consequence,
membership, time sequence, or other. A new analysis of the text T relates ci and cj when
they are no more than z words distant from each other and when there is no full stops
between them.

Depending on the size of set C, a term ci would be related to too many other terms,
which would produce a graph in which the connections would have no particular
meaning. To reduce the number of connections between terms, a maximum of r links is
allowed for any term. However, terms with high frequency do not have the same
number of connections as terms with low frequency. Each term may have at most ω
connections and this value is proportional to the frequency τ of that term. In this way,
more frequent terms have a larger value ω and may be connected to a larger number of
terms in the graph. Only the term with the highest τ will have ω = r; the other terms will
have at most r times their frequency divided by the frequency of the most frequent term.

τc × r
ωc = i

i
max {τ ∈ C }

In cases when a term could have more relationships than it is allowed to have, the
relationships selected to be displayed are those that occur more frequently. There is no
lower bound to the number of times a relationship between two terms should occur in
the text for it to be considered as a link in the graph. Sobek uses z = 5 and r = 7; those
specific parameters were defined based on the users' review of the tool. A bigger value
for z would link terms that are not always related and a bigger value for r would
produce a larger number of connections, which could make their interpretation more
difficult.

Sobek's Graphical Visualization

Sobek's final step is related to the creation of a graphical representation of terms


extracted from the text. In this representation, the terms are represented by nodes and
the relationship between them as links between nodes. To enhance visualization, each
node has a different size based on its frequency. The larger the node, the higher its
relative frequency is when compared to that of other nodes. Figure 1 shows a graph
obtained from a Wikipedia text about Global warming.
Figure 1: Graph extracted from Wikipedia text on Global Warming1

The graph produced has a set of functionalities that allows the user to personalize it. It is
possible to add and remove nodes and connections, as well as to interact with the graph
by changing a node's position. When clicking on a given node, its frequency is
displayed below the graph, as can be seen in figure 1. Fragments from the text in which
the term appears are also displayed below the graph. Prefuse API 2 has been used to
display the graph in Sobek.

Evaluating Sobek's capacity to extract relevant concepts from texts

To evaluate the quality of the results returned by Sobek, two metrics have been defined:
the gold standard and Cohen's Kappa Coefficient (Cohen 1960). These metrics provide
information related to the accuracy of the methods used to extract relevant terms from
texts when comparing the results with the opinions of experts. The goal has been to
evaluate whether Sobek is capable of extracting terms from texts that are close to the
ones experts in the field of research would provide.

1 https://en.wikipedia.org/wiki/Global_warming
2 http://www.prefuse.org/doc/api/
For the gold standard test, two different articles were selected from journals with a
focus on education and technology. The articles were presented to six experts in the
field, namely researchers and PhD candidates. Each participant had to choose one article
closer to his/her expertise, and then write a list of terms he/she thought best described
the text's main ideas. Half of the participants selected article A, and the other half
selected article B. The gold standard list for each text was created by considering terms
that were in 2 or 3 of the experts' lists. Table 1 shows the Gold Standard list and the list
of terms extracted by Sobek for each of the articles.

Table 1: Gold standard list and that extracted by Sobek

Text A Text B
Gold Standard List Sobek Gold Standard List Sobek
classroom classroom bus bus
communication communication internet internet
facebook Facebook learning learning
learning learning network network
parents parent school schools
school school wi-fi wi-fi
social media social media wireless wireless
students student opportunities mobile
benefits students
Sherry technology in technology
education

The gold standard test enables the computation of a sensitivity measurement (also called
true positive or recall), in which terms considered relevant by Sobek (nS) and the
experts (nE) are taken into account. For article A, the sentitivity measurement indicating
Sobek’s ability to correctly identify relevant terms was:

sensitivity = nS / nE = 8 / 8 = 1.0

The precision measurement was computed by dividing the number of correct terms
identified as relevant by Sobek (nS) by the total number of terms (tS):

precision = nS / tS = 8 / 10 = 0.8

Article B was read by 3 different experts, and Sobek correctly identify 8 terms present
in the gold standard list. The sensitivity and precision computed for this text were:
sensitivity = 8 / 9 = 0.88

precision = 8 / 10 = 0.8

Considering the precision and the sensitivity measurements, an F-Score may be


computed, indicating the overall accuracy of the method, with values ranging from 0 to
1.

2∗precision∗sensitivity
F 1=
precision+ sensitivity

According to this definition, the F1-score computed for articles A (F1A) and B (F1B)
were:

2∗0.8∗1.0 2∗0.88∗0.8
F1 = =0.89 F 1 = =0.84
A
0.8+1.0 B
0.88+ 0.8

In order to evaluate the inter-rater agreement between the gold standard’s list and
Sobek’s list of terms, Cohen’s Kappa coefficient has been used. Cohen’s Kappa
coefficient is an agreement measurement between judges for qualitative items that does
not take into account the agreement that may occur by chance (Carletta 1996). For N
items, it measures the agreement between judges when classifying them into C mutually
exclusive classes. The result is a value between 0 (in which case the agreement occurs
only by chance or the judges do not agree at all) and 1 (meaning complete agreement).
The value for Kappa's coefficient is given by:

Pr ( a )−Pr ⁡( e)
K= ,
1−Pr ⁡( e)

where Pr(a) is the relative observed agreement among raters and Pr(e) is the
hypothetical probability of chance agreement.

Kappa coefficient has been used to compute the agreement between Sobek and the gold
standard list regarding the classification of concepts. Each term of the text were
classified as relevant or irrelevant. As each term could be placed in only one list
(relevant or irrelevant), all concepts not listed by Sobek or the gold standard were
automatically placed in the irrelevant list. This evaluation did not consider compound
terms. The reasons for this were twofold: first, the number of possible terms of each text
would increase drastically; secondly, the bias of choosing the maximum size a
compound term could be avoided by considering only single terms (although they
usually do not include more than 2 or 3 words).

The first text (article A) had a total of 453 single terms. Sobek’s concept list had 10
terms considered relevant, therefore classifying 443 terms as irrelevant. The gold
standard’s list contained 8 relevant terms, considering 445 irrelevant. Both lists had 7
similar terms.

The observed agreement between lists totaled 451 terms, while the expected agreement
by chance was of 435 terms. The value of kappa’s coefficient for article A was 0.79,
which is considered a substantial agreement. This high value for K implies that Sobek's
concept list and the gold standard's list had a similarity that was not a result of chance.

The same principles applied to article B, which had 321 single terms. Sobek considered
10 of those terms as relevant and the gold standard list considered 9, sharing 8 similar
terms. In this scenario, the number of observed agreement totaled 318 terms, and the
expected agreement by chance was of 302 terms. As a result, the value of K was 0.837,
which is considered an almost perfect agreement. Such results demonstrate a very good
performance regarding the tools' capacity to extract relevant terms from texts.

Conclusion

This article presented an evaluation of a text mining tool that is capable of extracting
relevant terms and relationships from a text, presenting this information in the form of a
graph. We described a few experiments carried out in the field of education showing
that the text mining tool has a good level of accuracy when extracting terms from texts,
and that the visual representations provided may help students in text understanding and
concept building in the area of Science. The experiments using the gold standard tests
showed that Sobek's concept lists extracted from a couple of texts were very close to the
ones provided by experts. The values of both sensitivity and precision were high in both
experiments. The Kappa's coefficient values found in the experiments strongly indicates
that Sobek is capable of extracting relevant terms and those terms are not selected by
chance. It also indicates a strong agreement between Sobek's term list and that of the
gold standard, once again reinforcing the accuracy of the text mining method. Such
results indicate a good potential for the use of Sobek in educational applications.

Future works include the study on how to improve the representation of terms by
connecting the graph's nodes to specific ontologies. Sobek has also been released in a
Chrome extension and it will be incorporated in a virtual learning environment, which
may promote its use by a larger number of students and teachers. The observation of
how students and teachers use the tool should also give us further insight about possible
applications of Sobek in novel research initiatives.

References

Azevedo B. F., Reategui E., Behar P. A. (2014). Analysis of the relevance of posts in
asynchronous discussions. Interdisciplinary Journal of Knowledge and Learning
Objects, 10:107–121.

Calvo, R. A., O’Rourke, S. T., Jones, J., Yacef, K., and Reimann, P. (2011).
Collaborative writing support tools on the cloud. IEEE Transactions on Learning
Technologies, 4(1):88–97.

Chein, M. and Mugnier, M.-L. (2008). Graph-based Knowledge Representation:


Computational Foundations of Conceptual Graphs. Springer Publishing Company,
Incorporated, 1 edition.

Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and


Psychological Measurement, 20(1):37.

Costa, P. S. C., Reategui, E. (2011). Foreign Language Literacy through Fanfiction


Writing and Text Mining. Literacy Information and Computer Education Journal
(LICEJ), v. 2, p. 456-461.

Dirks E., Spyer G., van Lieshoult E. C. D.M., de Sonneville, L (2008). Prevalence of
combined reading and arithmetic disabilities. Journal of Learning Disabilities, pages
460––473.

Feldman, R. and Sanger, J. (2006). Text Mining Handbook: Advanced Approaches in


Analyzing Unstructured Data. Cambridge University Press, New York, NY, USA.
Geva, T. and Zahavi, J. (2010). Predicting intraday stock returns by integrating market
data and financial news reports. In MCIS, page 39. AISeL.

Guzzetti, B. J.; Gynder, T. E.; Glass, G. V.; Gamas, W. S. (1993). Promoting conceptual
change in science: A comparative meta-analysis of instructional interventions from
reading education and science education. Reading Research Quarterly, p. 117-159.

He, W. (2013). Examining Students' Online Interaction in a Live Video Streaming


Environment Using Data Mining and Text Mining. Computers in Human Behavior, 29
(1): 90-102. DOI: https://doi.org/10.1016/j.chb.2012.07.020

Hsu, J.-L., Chou, H.-W., Chang, H.-H. (2011) EduMiner: Using text mining for
automatic formative assessment. Expert Systems with Applications, 38(4): 3431-3439.

Jenner, J. (2003). A bridge to reading and writing literacy: Developing oral language
skills in young children.

Korhonen, J., Linnanm¨aki, K., and Aunio, P. (2014). Learning difficulties, academic
wellbeing and educational dropout: A person-centred approach. Learning and
Individual Differences, 31(0):1 – 10.

Krallinger, M. and Valencia, A. (2005). Text-mining and information-retrieval services


for molecular biology. Genome Biology, 6(7):224+.

Macedo, A. L., Reategui, E. B., Lorenzatti, A., and Behar, P. A. (2009). Using
textmining to support the evaluation of texts produced collaboratively. In WCCE,
volume 302 of IFIP Advances in Information and Communication Technology, pages
368–377. Springer.

Marzano, R., Pickering, D., and Pollock, J. (2001). Classroom Instruction that Works:
Research-based Strategies for Increasing Student Achievement. Gale virtual reference
library. Association for Supervision and Curriculum Development.

Mittermayer, M.-A. (2003). Forecasting intraday stock price trends with text mining
techniques. Hawaii International Conference On System Sciences, 03.

Nandhini, K. and Balasundaram, S. (2013). Improving readability through extractive


summarization for learners with reading difficulties. Egyptian Informatics Journal,
14(3):195 – 204.
Novak, J. D. and Cañas, A. J. (2006). The origins of the concept mapping tool and the
continuing evolution of the tool. Information Visualization, 5(3):175–184.

Özmen, H. (2011). Effect of animation enhanced conceptual change texts on 6th grade
students’ understanding of the particulate nature of matter and transformation during
phase changes. Computers & Education, v. 57, n. 1, p. 1114-1126.

Pang, B. and Lee, L. (2008). Opinion mining and sentiment analysis. Trends Inf. Retr.,
2(1-2):1–135.

Pinho, I. da C., Epstein, D., Reategui, E., Correa, Y. (2013). The use of text mining to
build a pedagogical agent capable of mediating synchronous online discussions in the
context of foreign language learning. In Proceedings of International Conference of
Frontiers in Education (pp. 393-399). New York: IEEE Press.

Shatnawi, S., Gaber, M. M., Cocea, M. (2014). Text stream mining for Massive Open
Online Courses: review and perspectives. Systems Science & Control Engineering,
2(1): 664-676.

Tseng, Y.-H., Lin, C.-J., and Lin, Y.-I. (2007). Text mining techniques for patent
analysis. Inf. Process. Manage., 43(5):1216–1247.

Villalon, J., Kearney, P., Calvo, R., and Reimann, P. (2008). Glosser: Enhanced
feedback for student writing tasks. In Advanced Learning Technologies, 2008. ICALT
’08. Eighth IEEE International Conference on, pages 454–458.

Warschauer, M. (2006). Laptops and Literacy: Learning in the Wireless Classroom.


New York: Teachers College Press.

Wei, C.-W., Hsieh, Z.-H., Chen, N.-S., and Kinshuk (2012). Construction of reading
guidance mechanism on e-book reader applications for improving learners’ english
comprehension capabilities. In ICALT, pages 170–172. IEEE.

Winograd, P. N. (1983). Strategic difficulties in summarizing texts. University of


Illinois at Urbana-Champaign ;Cambridge, Mass.

Wuthrich, B., Cho, V., Leung, S., Permunetilleke, D., Sankaran, K., Zhang, J., and Lam,
W. (1998). Daily stock market forecast from textual web data. In Ieee International
Conference On Systems, Man, And Cybernetics, pages 2720–2725.
Xu, Y., Reynolds, N. Using Text Mining Techniques to Analyze Students' Written
Responses to a Techer Leadership Dilemma. International Journal of Computer Theory
and Engineering, 4(4), 2012.

Yu, C. H., DiGangi, S. A., & Jannasch-Pennell, A. (2011). Using Text Mining for
Improving Student Experience Management in Higher Education. In P. Tripathi, & S.
Mukerji (Eds.) Cases on Innovations in Educational Marketing: Transnational and
Technological Strategies (pp. 196-213). Hershey, PA: Information Science Reference.
doi:10.4018/978-1-60960-599-5.ch012

You might also like