Comparison of Learning Analytics and Educational Data Mining

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Computers and Education: Artificial Intelligence 2 (2021) 100016

Contents lists available at ScienceDirect

Computers and Education: Artificial Intelligence


journal homepage: www.elsevier.com/locate/caeai

Comparison of learning analytics and educational data mining: A topic


modeling approach
David J. Lemay 1, Clare Baek b, *, Tenzin Doleck c
1
McGill University, USA
b
University of Southern California, USA
c
Simon Fraser University, USA

A R T I C L E I N F O A B S T R A C T

Keywords: Educational data mining and learning analytics, although experiencing an upsurge in exploration and use,
Educational data mining continue to elude precise definition; the two terms are often interchangeably used. This could be owing to the fact
Learning analytics that the two fields exhibit common thematic elements. One avenue to provide clarity, uniformity, and consistency
Topic modeling
around the two fields, is to identify similarities and differences in topics between the two evolving fields. This
Structural topic modeling (stm)
paper conducted a topic modeling analysis of articles related to educational data mining and learning analytics to
reveal thematic features of the two fields. Specifically, we employed structural topic modeling to identify the
topics of the two fields from the abstracts. We apply structural topic modeling on N¼192 articles for educational
data mining and N¼489 articles for learning analytics. We infer five-topic models for both educational data
mining and learning analytics. We find that while there appears to be disciplinary differences in terms of research
focus, there is little support for a clear distinction between the two disciplines, beyond their different lineage. The
trend points to a convergence within the field of educational research on the applications of advanced statistical
learning techniques to extract actionable insights from large data streams for optimizing teaching and learning.
Both fields have converged on an increasing focus on student behaviors over the last five years.

1. Introduction discovery. As information has become more plentiful, new approaches of


leveraging this data into actionable insights has transformed practices.
Artificial intelligence (AI) research in the field of education is still in Learning analytics, like management analytics and business intelligence
its early development. AI in education is best represented by two data- before it, is predicated on leveraging educational data for informing its
centric fields that have arisen in the last few decades (Baek & Doleck, own ends, that is, teaching and learning.
2020), educational data mining (EDM) and learning analytics (LA), that The empirical research tends to treat LA and EDM if not as coexten-
exploit machine learning in educational research. Many reviews to date sive, at least as two kindred endeavors dedicated to the application of
have attempted to describe these two complementary fields which advanced statistical learning methods to the analysis of learning and
evolved as a result of the democratization of access to computing power educational data respectively. While many treat the two fields as
and open source and open data movements in educational research generally interchangeable (Romero and Ventura, 2020), some adherents
(Aldowah et al., 2019; Baker and Inventado, 2014; Calvet Li~ nan and Juan defend a clear boundary between the two fields. The primary argument
Perez, 2015; Ihantola et al., 2015; Papamitsiou and Economides, 2014; relies on an exclusionary definition of their respective fields of inquiry. In
Romero and Ventura, 2020). essence, the difference hinges on the recognition that learning is not
“Data mining based on educational data is named as “educational education and vice versa, regardless how much overlap there might be.
data mining” and the use of the patterns based on educational/instruc- Thus, LA is focused on the processes influencing learning, at the indi-
tional data is called learning analytics (Sahin & Yurdugul, 2019, p. 122). vidual and social levels, whereas EDM is focused on knowledge discovery
Data mining goes back to the initial development of electronic record from all educational data sources produced by individuals and groups of
keeping and the advent of object-relational databases and the exploita- individuals supported by institutional frameworks. In practice, it may be
tion of structured records through pattern analysis for knowledge harder to discern a clear boundary since research amply demonstrates

* Corresponding author.
E-mail addresses: david.lemay@mail.mcgill.ca (D.J. Lemay), clarebae@usc.edu, clarebae@usc.edu (C. Baek), tdoleck@sfu.ca (T. Doleck).

https://doi.org/10.1016/j.caeai.2021.100016
Received 17 December 2020; Received in revised form 9 March 2021; Accepted 14 March 2021
2666-920X/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

that institutional factors influence individual learning behaviors (Baeten visualization analytics, such as graph or network based methods.
et al., 2010; Prosser and Sze, 2014; Vermunt and Donche, 2017). Indeed, In a theoretically informed study, Xing et al. (2015) explicitly
the empirical literature also shows a great deal of similarity in the range distinguish between the contributions of LA, EDM, and human-computer
and diversity of research topics and analytical methods. interaction (HCI). However, the overarching theory is activity theory as
Papamitsiou and Economides (2014) summarize the salient differ- formulated by Leontev (1981) and collaborators. All three disciplines
ence between LA and EDM forming consensus in the field as: “LA adopts a have been influenced by constructivist psychology, which is especially
holistic framework, seeking to understand systems in their full evidenced by shared interests in group processes and social interaction.
complexity. On the other hand, EDM adopts a reductionistic viewpoint by While more disciplinary exchange can help bridge differences (Siemens
analyzing individual components, seeking for new patterns in data and and Baker, 2012), we wonder to what extent these two disciplines are
modifying respective algorithms” (p. 50). However, Aldowah et al. indeed distinct and whether the differences are more taxonomic than
(2019) identify four dimensions of research that crosscut the disciplinary operative in practice. The fertility of exchange between the two disci-
differences. Both LA and EDM conducted research along the four plines would suggest that EDM and LA are two specimens of the same
following dimensions: computer-supported learning analytics, which genus if not the same species.
includes dropout/retention, student performance, and evaluation; Romero and Ventura (2020) report that most EDM and LA research
computer-supported predictive analytics, including collaborative are conducted in the same research settings: virtual learning environ-
learning and self-learning; computer-supported behavioral analytics, ments, learning management systems, cognitive tutors, and other
focused on modelling student learning; and computer-supported computer-based learning environments, including mobile devices and

Table 1
Extant Reviews on both EDM and LA.
Title Author Focus Method Findings

Educational Data Mining and Romero and Reviewed milestones, knowledge Systematic review & - EDM and LA research are often conducted in the
Learning Analytics: An Updated Ventura (2020) discovery, educational environments, frequency analysis same research settings (e.g., virtual learning
Survey stools, datasets, methods, objectives, environment).
and future trends for both EDM and LA. - EDM and LA are in the process of moving from
the lab to the general market to be used by
educational institutions.
- EDM and LA tools are not easy for educators to
use.
Educational Data Mining and Şahin and Compared EDM and LA by looking at Temporal comparison - Similarities and differences of EDM and LA are
Learning Analytics: Past, Yurdugül (2019) historical development of each field and on process and application.
Present, and Future future direction of each field. - EDM focuses on automated systems whereas LA
focuses on designs where a human is present
- There has been an ongoing confusion of the two
fields.
Differentiating between Dormezil et al. Compared similarities and differences Bibliometric analysis, - Major research themes of LA focus on student-
Educational Data Mining and (2020) between EDM and LA Keyword analysis using focused learning objectives.
Learning Analytics: A natural language - Major research themes of EDM focus on the
Bibliometric Approach processing algorithms behind predicting student
performance.
- Bibliometric perspective reveals that EDM is a
subdomain of LA.
Educational Data Mining and Aldowah et al. Reviewed EDM and LA in higher Systematic review, - EDM and LA are used to provide opportunities
Learning Analytics for 21st (2019) education by identifying and comparing frequency analysis and solutions regarding computer-supported
century Higher Education: A EDM and LA techniques. analytics.
Review and Synthesis - EDM and LA applications in higher education
can provide benefits in developing a student-
focused strategy.
Research in Learning Analytics ElSayed et al. Reviewed EDM and LA literature on Systematic review, - There is an increasing interest in the use of EDM
and Educational Data Mining to (2019) measuring self-regulated learning in frequency analysis and LA for assessing self-regulated learning in
Measure Self-Regulated students. students.
Learning: a Systematic Review - Most studies used a combination of self-reported
instruments and LA and EDM tools.
Educational Data Mining and Linan and Perez Compared similarities and differences Comparison by topics, - Popularity of LA and EDM have grown from
Learning Analytics: Differences, (2015) between EDM and LA by listing goals trend analysis 2010 to 2014
Similarities, and Time Evolution and methods - EDM focuses more on classification and
classification whereas LA focuses more on social
network analysis
- However, many similarities exist between both
fields: goals, methodologies, techniques
Application of big data in Sin and Muthu Reviewed literature on EDM and LA by Frequency analysis, - Major trends of EDM articles are about new EDM
education data mining and (2015) looking at publication type, year, and trend analysis, techniques and analyzing student performance
learning analytics – a literature authors. summary, thematic - Major trends of Learning Analytics articles are
review analysis about development of LA design and models,
using LA as a tool for assessment, and exploring
EDM and LA methods
Learning Analytics and Papamitsiou and Reviewed EDM and LA case studies by Systematic review - Four distinct major axes of the LA and EDM
Educational Data Mining in Economides analyzing research questions, empirical research include: pedagogy-oriented
Practice: A systematic literature (2014) methodology, and findings. issues, contextualization of learning, networked
review of empirical evidence learning, educational resources handing.
Learning Analytics and Steiner et al. Summarized key concepts, objectives, Summary overview - LA can be used in educational games and serious
Educational Data Mining: an (2014) methods, visualizations, and games to collect and analyze learners’ data.
Overview of Recent Techniques applications of LA and EDM. - LA has shifted its focus from technical
orientation to educational practice.

2
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

massive open online courses (MOOCs) and social learning platforms. research, but the theory has been easily extended to the online envi-
Researchers collect all kinds of data including “login frequency, number ronment (Vescio et al., 2008). Suitably enough for our research concern,
of chat messages between participants and questions submitted to the the CoP framework has been embraced and adapted in the specific
instructors, response times on answering questions and solving tasks, context of creating teacher professional learning communities meant to
resources accessed, previous grades, final grades in courses, detailed bring teachers (and researchers) together from beyond the school walls.
profiles, preferences from LMSs, forum and discussion posts, affect ob-
servations (e.g. bored, frustrated, confused, happy, etc.) and many more” 2. Purpose and research question
(p. 52). Researchers also employ a range of analytical methods including
“classification, followed by clustering, regression (logistic/multiple) and The need for comparing EDM and LA is warranted given the ambi-
more recently, discovery with models. In addition, algorithmic criteria guity in the definitions and goals of the two emerging fields (Baek &
computed for comparison of methods include precision, accuracy, Doleck, 2021). Hence, the purpose of this study is to understand the
sensitivity, coherence, fitness measures (e.g. cosine, confidence, lift, similarities and differences between EDM and LA, through an approach
etc.), similarity weights, etc” (p. 53). The main areas of research include that is yet to receive research attention—topic modeling of abstract data
“student/student behavior modeling and prediction of performance, from articles on EDM and LA to extract underlying topics. The present
followed by increase of students’ and teachers’ reflection and awareness article attempts to answer the following research question: What are the
and improvement of provided feedback and assessment services” (p.53). similarities and differences between educational data mining and
Thus, it is unclear to what extent these two fields are effectively different learning analytics? By addressing this question, we can contribute to a
in practice. Actually assessing the differences between the two disciplines better understanding of the field and its delineations into sub speciali-
will require a systematic analysis of their research production as ulti- zations, their boundaries and overlaps and their tensions and
mately these research communities are defined by their research contradictions.
production.
As shown in Table 1 below, there have been previous studies that 3. Method
reviewed the status of the EDM and LA literature. These previous reviews
examined varieties of components of LA and EDM such as methods, data, 3.1. Research design
tools, past and future trends, and publication type. Most of these reviews
used a traditional narrative literature review approach along with fre- In the present study, we employ an exploratory retrospective design
quency analysis of a component of interest. to attempt to describe disciplinary differences between the LA and EDM
Topic modeling is a valuable and appropriate tool to uncover mean- communities through the analysis of research abstracts as the canonical
ingful topics from large quantities of textual data and yet largely unex- artifact generated across all disciplinary boundaries.
plored by education related research (Chen et al., 2020a). There have
been a few examples of educational technology related literature reviews 3.2. Search and screening strategy
successfully incorporating a topic modeling approach to analyze ab-
stracts and keywords of articles to uncover research topics and their trend Our data collection and analysis were conducted in several stages
over a time period (e.g., Chen et al., 2020a; Chen et al., 2020b). In (Fig. 1). First, we searched for EDM articles in the Web of Science data-
comparing EDM and LA specifically, Dormezil et al.‘s (2020) review is an base with the following indexes: SCI-EXPANDED, SSCI, A&HCI, CICI–S,
example of using topic modeling technique for analysis. Dormezil et al. CPCI-SSH, BKCI–S, BKCI-SSH, EDCL, CCR-EXPENDED, IC. The web of
(2020) retrieved the keywords from the literature via bibliometric science data base has been recommended by previous literature reviews
analysis and used natural language processing to examine the categories as a database with a collection of high-quality papers (Xia and Zhong,
of EDM and LA keywords separately. The categories of the keywords 2018; Xie et al., 2019; Fu and Hwang, 2018). We used the search terms,
reveal that EDM focuses on the algorithms of predicting student perfor- “Educational Data Mining” (TOPIC: (“Learning Analytics”) AND DOCU-
mance whereas LA focuses on student-focused learning objectives. MENT TYPES: (Article)) and filtering the time period to be 2015–2019.
To our knowledge, there has not been a literature review that We followed the same exact process for LA articles, except using the
compared both abstracts and keywords from EDM and LA articles and search terms “Learning Analytics” (TOPIC: (“Learning Analytics”) AND
also compared the changing research trend over time. Therefore, we aim DOCUMENT TYPES: (Article)). We chose the search terms to be
to fill this gap with the current literature review. Specifically, we extend “Educational Data Mining” and “Learning Analytics” to collect articles for
the existing literature by using a topic modeling approach to do the the respective fields since we wish to examine the overall similarities and
following: First, we compare the changing trend of each field overtime differences of the two fields. Inserting other search terms would have
through keyword analysis over the five-year period from 2015 to 2019. narrowed down the search results as well as including articles from other
Second, we investigate the general focus of each field holistically by related fields.
analyzing the abstracts of each article in our literature collection using a The initial search results for EDM was 281 articles and LA 850 arti-
topic modeling approach. Third, we investigate the changing focus of cles. We screened for relevant articles following our inclusion and
each field by analyzing the abstracts of the literature separately by year exclusion criteria (Table 2). Particularly, we only included empirical
for the five-year period. In doing so, we compare the overall landscape of articles and excluded theoretical or conceptual papers. For example, ar-
the literature as well as the changing trend of EDM and LA by comparing ticles that propose a new learning analytics model without empirical data
the analysis results using a topic modeling approach. were excluded. As we stated previously, we aimed to examine how each
This study is informed by communities of practice perspective of the two fields is operating in practice and empirical papers reveal this
(Wenger, 1998). Specifically, the fields of LA and EDM are conceived as a actual operation in research.
constellation of practices producing knowledge artifacts following reified We followed a two-step process for screening articles. First, we read
and codified activities. These communities are mutually interacting and the abstracts of the articles to exclude articles and sort out articles that
mutually informing. They are involved in similar pursuits and share the require further investigation as some abstracts warranted a close reading
same concerns respecting the object study and modes of inquiry. Critical of the full text to ensure that the studies meet our inclusion criteria.
self-reflection helps to collectively negotiate the object and modes of the Second, for those studies that required further investigation (e.g.,
collective research endeavor. This negotiation of meaning is foundational whether a study tested a new learning analytics model with empirical
to the processes of mutual engagement, shared enterprise, and shared data or merely presented a hypothetical model), we conducted a full-text
repertoire which characterize communities of practice. There is an screening to determine whether to include or exclude the articles. The
inherently physicalist assumption in early communities of practice screening process led to the collection of 492 articles for LA and 194

3
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

Fig. 1. Study Selection flowchart for Educational Data Mining and Learning Analytics.

articles for EDM. Fig. 2 shows the distribution of the articles in our final following words before our analysis: results, result, study, can, paper,
collection for LA and EDM by year. Lastly, we created two separate using, use, used, research. Due to the nature of how abstracts are written,
documents containing each field’s abstracts. The articles without ab- many abstracts included the words listed above as authors report their
stracts were excluded from the analysis as our primary interest for topic findings in their abstract. To ensure that our topic modeling analysis
modeling is abstracts. This resulted in the final collection of 489 articles includes meaningful words only, we decided to exclude these words.
for LA and 192 articles for EDM. Lastly, we trimmed the DFM to remove features that are rare and ubiq-
uitous to ensure less noise and efficient processing of the DFM.
3.3. Abstract analysis process Next, we used the stm package and quanteda package in R Studio to
run the structural topic modeling. Structural topic modeling is an unsu-
The overall topic modeling process to analyze the abstracts is pervised method which allows researchers to discover topics in a docu-
graphically represented in Fig. 3. First, we created a corpus of 489 doc- ment of words as each document can be composed of multiple topics
uments (abstracts) of LA and a corpus of 192 documents (abstracts) of (Roberts et al., 2014). Since stm is an unstructured method, we first ran a
EDM. Second, we created a document feature matrix (DFM), which is a set of models by setting topic numbers to 3, 4, 5, 6, 7, 8, 9, 10. After
table with rows as texts and columns as words. Then, we cleaned the DFM running the analysis for each number of topics, we inspected and
to prepare for topic modeling analysis by removing punctuation, compared the 5 different models by adapting a criteria used by Chen et al.
numbers, stopwords, and symbols in R Studio. We also removed the (2020). Specifically, we looked for the following: the terms in each topic

4
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

Table 2 et al., 2011; Pandur et al., 2020). Semantic coherence evaluates fre-
Inclusion and exclusion criteria for collecting educational data mining and quency of co-occurrence of terms with the highest probability in a given
learning analytics literature. topic such that the semantic coherence increases when the co-occurrence
Inclusion Exclusion of the two terms with the highest probability in a topic increases (Pandur
Timelines Published in 2015–2019 Published before 2015
et al., 2020). The lower bound indicates convergence of the model and
Type of Scholarly journal articles Not peer-reviewed the model is considered converged when the bound has small enough
document change between iterations (Roberts et al., 2014).
Peer-reviewed For both LA and EDM, N ¼ 5 was most ideal because it had a relatively
Conference papers
high held-out likelihood, high semantic coherence, and low residuals
Availability of Full-text access available Full-text access not available
Access collectively compared to other values of N. Second, the growth or fall of
Language Written in English Written in other languages than the value of each diagnostic measure slowed down at N ¼ 5 (Pandur
English et al., 2020).
Relevance Empirical papers (including Theoretical/conceptual papers, For the final analysis, we chose N ¼ 5 for the number of topics as
pilot studies, preliminary literature reviews, policy
findings, testing results with reviews, not about Learning
described above using the stm package and quanteda package in R studio
samples) Analytics, not about and set the parameter for each topic to contain 10 terms with the highest
Educational Data Mining topic probability. After generating a list of five topics with ten terms each,
All grade level (e.g., k-12, we also plotted the stm model using the stm package to visualize the
higher education, adults)
expected proportion of the corpus that belongs to each topic (Figs. 6 and
Pertaining to learning Pertaining to other
environments (e.g., elementary environments than learning 7). The three words listed with each topic are the top three words asso-
school classroom, MOOC for ciated with the topic (Roberts et al., 2014). Overall, we followed the
higher education) heuristic model of stm features (Roberts et al., 2014): evaluation phase
The primary purpose is The primary purpose is not for determining the number of topics and number of terms to be listed,
exploring educational context exploring educational context
understanding phase for examining the topics generated, and visualizing
(e.g., training teachers) (e.g., developing machines not
related to learning) phase for visualizing the topics listed through plots.

4. Abstract analysis and results


could form a meaningful topic together, topics aligned with important
EDM and LA fields, and there was no overlap between topics within one 4.1. Common topics for 2015–2019
topic model. Based on this criteria, we determined that N ¼ 5 is the best
number of topics for both articles in EDM and LA. To investigate the common topics studied in LA papers and EDM
We ran diagnostic tests for the number of topics to make sure that the papers from 2015 to 2019, we analyzed the abstracts of the papers using
ideal number of topics is five and the model with topic number 5 per- stm topic modeling. For LA papers, we found that the most common topic
forms better compared to other numbers of topics. We examined several (Topic 4) was on students’ academic performance and engagement in
components by running the stm package in R studio to conduct this online courses (Table 3, Fig. 6). The second most common topic was on
diagnostic test: held-out likelihood, residuals, semantic coherence, and analytics of student data through pattern findings and different tech-
lower bound (Figs. 4 and 5). Held-out likelihood evaluates how well the niques (Topic 1). The third most common topic was providing feedback
information learned from a corpus applies to unseen documents (Chang as a teaching tool through assessment and activities (Topic 2). The topic
et al., 2009). A model that performs better would have a higher held-out on designing a model, platform, or a system based on learner information
probability (Wallach et al., 2009). Residuals can be used to evaluate was the fourth most common topic (Topic 5) and social network analysis
whether the estimated number of topics produce a good model fit; large about students’ collaboration, discussion, and engagement as a group
residuals, over dispersion of residuals, can indicate that the true number online was the fifth most common topic (Topic 3).
of topics is greater than the number of estimated topics (Taddy, 2012). For EDM papers, we found that the most common topic (Topic 1) was
Semantic coherence can be used to predict topic quality in a model and it predicting students’ academic performance through educational data
corresponds well with the human judgment of topic quality (Mimno mining techniques such as algorithms, prediction, and classification

Publica ons of LA and EDM from 2015-2019


160
137
140

120 113
106
100

80 71
62 65
60
39 41
40
25 22
20

0
2015 2016 2017 2018 2019

Learning Analy cs Educa onal Data Mining

Fig. 2. Publications of LA and EDM from 2015 to 2019.

5
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

(Table 4, Fig. 7). The second most common topic (Topic 3) was about
outcomes of student online learning based on different environments and
activities. The third most common topic was on methods of analyzing
learners’ data (Topic 4). The fourth most common topic was on models
for assessing problems and factors of student data (Topic 5) and the fifth
most common topic was about students in engineering courses (Topic 2).

4.2. Common topics of each year from 2015 to 2019

We also investigated how research topics of the LA and EDM litera-


ture change over the period of 2015–2019 by running stm topic modeling
separately for the abstracts of each year. Specifically, we conducted a stm
topic modeling analysis for the abstracts of the LA articles published in
each year from 2015 to 2019 and likewise for the EDM articles. As shown
in Appendix A, the topic modeling results reveal the research trend of
each field over the five-year period as well as infer how the two fields
evolve differently over the years.
The top five topics of the LA literature reveal the field’s consistent
focus on student performance and outcomes in a course as these words
constitute the top topics in every year. Some of the years consisted of
topics on social and collaboration aspects of learning. One of the top
topics of 2016 was regarding the social aspect of MOOC and one of the
top topics of 2019 was about social learning online. Interestingly, the top
topic of 2019 was about tools for data analytics although the previous
topics of the LA literature did not consist of topics regarding tools.
The top five topics of the EDM literature of each year reveal its sim-
ilarities and differences to the LA literature clearly. The top topics of the
EDM literature consistently included topics on student performance and
outcomes in a course like the LA literature. Unlike the top topics of LA,
the EDM literature did not focus on social and collaboration aspects of
Fig. 3. Graphical representation of topic modeling process. N ¼ number learning as none of the top topics were related to those throughout the
of topics. years. Also, the LA topics included “model” as one of the top topics in
2015 but the years after 2015 did not have topics regarding “model.”

Fig. 4. Diagnostic values by number of topics for learning analytics.

6
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

Fig. 5. Diagnostic values by number of topics for educational data mining.

Fig. 6. Expected proportion per topic in the corpus and three top words asso- Fig. 7. Expected proportion per topic in the corpus and three top words asso-
ciated with the topic for Learning Analytics. ciated with the topic for Educational Data Mining.

4.3. Keyword analysis and results the quanteda package in R Studio. This section examines the frequently
used keywords for LA and EDM as well as the trend over time of these
To compare research topics of LA and EDM, we also collected the terms.
keywords from each article, which we found below the abstract of the Table 5 below lists some of the frequently used keywords for LA and
articles. For the LA articles, 53 of the total collection did not contain EDM. The list of frequently used words depict the close connection of LA
keywords and for the EDM articles, 10 of the total collection did not and EDM. The top keywords of LA consist of EDM related terms:
contain keywords. These articles were excluded from the analysis. We educational (N ¼ 64), data (N ¼ 116), mining (N ¼ 56). Similarly, the top
conducted the analysis for the keywords in two steps. First, we conducted keywords of EDM consist of LA related terms: learning (N ¼ 145), ana-
frequency analysis using tidytext package in R Studio for the LA and EDM lytics (N ¼ 45). The list of top keywords reflects the differences of the
articles separately. Second, we chose six of the frequently used terms and fields as the LA keywords include “social” and “collaborative”, the EDM
conducted frequency analysis of each of these terms over the 5-year keywords do not. The top EDM keyword list includes “prediction”,
period for LA and EDM. For this analysis of terms over time, we used “machine”, and “clustering” but the LA keywords do not contain these

7
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

Table 3 Fig. 8 shows that the trend of the keywords, “systems” and “collab-
Results for learning analytics. orative” become increasingly similar starting in 2017 and they start to
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 overlap starting in 2018. There was a notable peak for the keyword
“online” in 2018. The frequency of “online” is higher than all the other
1 Data Students Social Students Learners
2 Analytics Teachers Analysis Course Design key terms in 2018 and also higher than all the frequency of “online” in
3 Students Feedback Group Student Model other years. Fig. 9 shows how these keywords, that were frequently used
4 Educational Assessment Online Performance System by the LA articles, were used in the EDM articles. The keywords,
5 Education Teaching Students Online Learner “assessment” and “social”, exhibit a similar trend starting in 2017 and
6 Analysis Tool Engagement Academic Based
7 Patterns Analytics Collaborative Courses Open
they both increase after 2018. Unlike the LA articles, “systems” and
8 Different Design Groups Time Information “collaborative” do not exhibit the same trend in the EDM articles. The
9 Techniques Activities Discussion Significant Tools keyword “student” was increasingly used in the keywords of the EDM
10 Findings Student Network Engagement Platform articles starting in 2016 unlike the LA articles where “student” does not
exhibit any noticeable increase.
For the most frequently used keywords in the EDM articles, we
Table 4 examined the frequency of the six keywords: “performance”, “predic-
Results for educational data mining. tion”, “analysis”, “machine”, “classification”, “student.” We also exam-
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 ined the frequency of the same keywords for the LA articles to compare
the difference between LA and EDM. Fig. 10 shows that the keyword
1 Performance Courses Learning Data Model
2 Data Course Students Mining Students “machine” was increasingly used in the EDM articles over the years,
3 Students Students Online Learning Proposed especially starting in 2017. A notable trend is that the keyword “student”
4 Academic Data Student System Knowledge was more frequent until 2018 and the frequency of “machine” overtook
5 Student Educational Different Educational Factors “student” after 2018. Also, the frequency of “prediction” and “analysis”
6 Prediction May Outcomes Learners Problem
7 Algorithms Two Activities Analysis Assessment
trended upward. The keyword “performance” trended downward start-
8 Classification Rules Time Methods Problems ing in 2018. Fig. 11 shows that the keyword “machine” was increasingly
9 Educational Knowledge Teachers Information Models used in the LA literature over the years. Also, the keyword “classification”
10 Mining Engineering Environments Analytics Data exhibits an upward trend starting in 2018 for the LA literature although it
shows a downward trend for the EDM literature starting in 2018.

Table 5 5. Discussion
Top keywords frequency of learning analytics versus educational data mining.
Learning Analytics Frequency Educational Data Mining Words Frequency Our analysis shows that over the recent five years from 2015 to 2019,
Words the difference between the topics of the papers from the two fields is a
Learning 725 Mining 181 matter of degree rather than kind. Both fields were focused on student
Analytics 393 Data 180 performance and learning platforms, and in modelling student behavior.
Data 116 Educational 163 LA papers focused more on student engagement, teaching tools, and so-
Education 92 Learning 145 cial network analysis whereas EDM papers focused more on techniques
Online 72 Analytics 45
Analysis 72 Student 33
and methods of data analysis. This aligns with the differences delineated
Educational 64 Education 33 by the existing literature that LA focuses more on the processes of
Mining 56 Performance 28 learning of individuals and EDM focuses more on knowledge discovery
Student 52 Prediction 27 from other data sources (Papamitsiou and Economides, 2014). As most of
Social 51 Analysis 23
the reviews point out, the fields of inquiry and research settings are
Systems 38 Classification 21
Assessment 38 Clustering 17 increasingly overlapping (Aldowah et al., 2019; Papamitsiou and Econ-
Collaborative 37 Model 16 omides, 2014; Romero and Ventura, 2009). This is evidenced by the
Design 34 Decision 16 overlapping topics. The top keywords for LA are online, social, systems,
collaborative, student; whereas for EDM, they are performance, predic-
tion, analysis, machine, classification, and student. This would tend to
terms.
bear out Sahin and Yurdugul’s (2019) pragmatic distinction, whereby LA
We chose six of the most frequently used keywords for LA and
examined the frequency of these terms overtime. We chose the following
keywords: “online”, “analysis”, “social”, “systems”, “assessment”,
“collaborative.” These words are the most frequently used keywords of
LA along with “learning”, “analytics”, “data”, “education”, “educational”,
“mining”, and “student” and we chose these specific words for a more
meaningful analysis. We also examined the frequency of the same key-
words for the EDM literature to compare the difference between LA and
EDM. Since “learning”, “analytics”, “educational”, “data”, and “mining”
constitute the majority of the proportion of the frequently used key-
words, we decided that it is more suitable to analyze frequency of the
keywords rather than the proportion of the keywords although the
number of keywords may differ per year. Furthermore, due to the nature
of keywords such that a list of keywords does not yield a substantial
quantity of text, frequency analysis is a more suitable technique than
analyzing proportion. As shown in Figures 8, 9, 10, and 11, the frequency
analysis results depict a different trend across the keywords over the
years which confirms that the frequency analysis yields meaningful
Fig. 8. Trend of frequently used LA keywords over the 5-year period in the
results.
LA articles.

8
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

(Leitner et al., 2019). Involving teachers into the feedback and assess-
ment process is crucial in adopting LA and EDM models (Tsai et al.,
2019).
Our temporal analysis revealed a clear trend of moving toward a focus
on the student in both fields. We take this as further evidence that in
terms of objects and modes of inquiry both fields are increasingly over-
lapping. Romeo and Ventura (2020) found that both LA and EDM are
working increasingly toward application of insights into practice. How-
ever, as the many reviews report, most insights derived from EDM and LA
are proving resistant to uptake. Reported barriers range from technical
proficiency to issues of knowledge translation moving from the research
laboratory to the field. Many researchers in both fields are preoccupied
with the application of research findings, which shows that Sahin and
Yurdugul’s (2019) pragmatic distinction may be more illusory than not,
as both EDM and LA would appear to be contending with the perennial
appeal of the applied sciences for usefulness as disciplinary justification.
Fig. 9. Trend of frequently used LA keywords over the 5-year period in the
EDM articles. Some key distinctions between LA and EDM are worth noting; most
notable is LA’s focus on applications to teaching practice, which is
evident in the field’s top topics each year. In every year during
2015–2019, students’ performance or learning in a course (e.g., “stu-
dents, performance, course”) was one of the top topics of the LA papers’
abstracts. The consistent inclusion of “course” as one of the main topics in
the LA literature depicts the field’s focus on applications to teaching and
learning practice in academic courses. On the other hand, the top topics
of the abstracts of the EDM literature were related to “performance” and
“system” each year rather than “course” which suggests the field’s lack of
focus on applications to actual teaching practice compared to the LA
literature. Indeed, discussions on using LA techniques and LA-integrated
applications for teaching practice and inquiry, also called teaching ana-
lytics by some studies, have been a part of the LA literature since the
beginning of the field (e.g., Prieto et al., 2018; Sergis and Sampson, 2017;
Wise and Jung, 2019; Vatrapu et al., 2011). As LA’s aim is not merely on
advancing technical methods, but to make a difference in practice, future
studies should examine the status of the teaching analytics literature
(Wise and Jung, 2019). Doing so could perhaps reveal further differences
Fig. 10. Trend of frequently used EDM keywords over the 5-year period in the
between the LA and EDM literature as well as the practical applications to
EDM articles.
teaching. Generalizability and individualized learning are not among the
core topics for both fields according to our analysis, which suggests a lack
of focus on generalizability and individualized learning for both fields in
their research. Generalizability of LA and EDM models across systems
have been addressed as one of the challenges that the two fields need to
overcome (Baker, 2019; Hutt et al., 2019). Designing and deploying
models that are generalizable and meeting individual student needs are
crucial in optimizing all student learning outcomes, providing equal ac-
cess to education for everyone across digital divides, and providing an
equally applied system for all students (Leitner et al., 2019; Ferguson
et al., 2016).
One of the most common topics of EDM papers is specifically about
students from engineering courses (Table 4), which implies that the
studies are largely focusing on a specific type of sample. Focusing on one
type of learner population is undesirabble as models are developed and
decisions are made upon training data. If the training data contains
largely one type of population, this can be unfairly biased against other
populations with diverse characteristics including race, gender, career
Fig. 11. Trend of frequently used EDM keywords over the 5-year period in the
trajectory, learning needs, and experience (Hutt et al., 2019).
LA articles.
Privacy concerns and ethics of student data collection are not com-
mon topics of the discussions for the papers from both fields, although
is focused on practice, and EDM is focused on methodology.
they are critical issues for the field of big data research as a whole. Dis-
Both LA and EDM focus on students in a learning environment or
cussions on privacy and ethical concerns should address topics pertaining
during activities through looking at feedback, assessment, and outcomes.
to transparent data collection, management protocols for secure data
Investigating outcomes and feedback of students and teachers in different
storage process, ownership of data, and system in place to receive
environments is crucial to both fields. When and how to deliver feedback
learners’ consent (Leitner et al., 2019; Ferguson et al., 2016). According
is an intricate decision that requires considerable effort (Leitner et al.,
to our analysis of the LA and EDM articles in the recent five-year period of
2019). Also, teachers and other stakeholders might have ethical and
2015–2019, topics on privacy and ethics, and generalizability and indi-
privacy concerns over LA and EDM as student data are collected and thus
vidualized learning have not yet been made into the core of the discourse.
feel hesitant towards LA and EDM deployment into learning settings
This is especially striking given the current reckoning of all fields that fall

9
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

under the general designation of artificial intelligence. As the ability of study is further limited by its focus on EDM and LA research published in
private and public entities to accumulate mass amounts of information the last five years only. Additionally, our focus on abstracts rather than
has grown, so have concerns about the reach of such unregulated use of the full-text analysis means much data was left unexplored. However, we
personal data. Data mining techniques can be exploited by powerful feel justified as abstracts are meant to summarize the important infor-
statistical learning algorithms to uncover hidden patterns in behavior. mation and ensure a modicum of uniformity across disciplines.
This “artificial” intelligence demonstrably has very real repercussions for
good and ill. These algorithms can be biased and can serve to perpetuate 5.3. Future directions
inequalities rather than serve the best interests of students and teachers.
As such algorithms are dependent on vast amounts of stored data, and A recommendation for the two fields is to increase discussions on
can be trained to learn tasks that were previously the domain of hu- privacy concerns and ethics of data collection and analysis, and protocols
manity, from visual recognition and language understanding, large-scale to build appropriate procedures to maintain data and protect student
data mining proffers an advantage to those capable of marshalling the privacy. Also, future studies should increase discussions on universal
greatest computational resources. This unequal access tends to favor the system design for generalizability and meeting individual student needs.
few at the expense of the many. Increased discussions on privacy and ethical concerns, and universal
system design for all learners can help to challenge the status quo, expose
5.1. Recommendations inequalities, and improve learning opportunities for learners in disad-
vantaged groups (Ferguson, 2019).
We suggest that the fields of LA and EDM ought to explore disci-
plinary blinds spots such as big data and AI ethics (Calvet Li~nan and Juan Funding
Perez, 2015) rather than focus on their differences. As well, they ought to
focus on theory and knowledge building rather than the exploratory and No funding to report.
non-programmatic research approaches that have dominated the nascent
fields (Ihantola et al., 2015). Moreover, the research must demonstrate Compliance with ethical standards
applicability to educational practice and explore means to make research
findings and tools actionable for practitioners. For example, there is ev- Yes.
idence that visual learning analytics (Vieira et al., 2018) are amenable to
classroom practice as they are more easily understood by teachers and Declaration of competing interest
students and readily applicable to the teaching and learning situations.
None.
5.2. Limitations

As a retrospective study, no causal interpretations can be made. Our

Appendix A

Results for Learning Analytics 2015.

Results for Educational Data Mining 2015.

10
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

Results for Learning Analytics 2016.

Results for Educational Data Mining 2016.

Results for Learning Analytics 2017.

11
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

Results for Educational Data Mining 2017.

Results for Learning Analytics 2018.

Results for Educational Data Mining 2018.

12
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

Results for Learning Analytics 2019.

Results for Educational Data Mining 2019.

References Baek, Clare, & Doleck, Tenzin (2021). Educational Data Mining versus Learning Analytics: A
Review of Publications from 2015-2019. Submitted for publication.
Baek, Clare, & Doleck, Tenzin (2020). A bibliometric analysis of the papers published in
Aldowah, H., Al-Samarraie, H., & Fauzy, W. M. (2019). Educational data mining and
the journal of artificial intelligence in education from 2015-2019. International
learning analytics for 21st century higher education: a review and synthesis.
Telematics Inf., 37, 13–49.

13
D.J. Lemay et al. Computers and Education: Artificial Intelligence 2 (2021) 100016

Journal of Learning Analytics and Artificial Intelligence for Education, 2(1), 67–84. Prieto, L. P., Sharma, K., Kidzinski, Ł., Rodríguez-Triana, M. J., & Dillenbourg, P. (2018).
https://doi.org/10.3991/ijai.v2i1.14481 Multimodal teaching analytics: automated extraction of orchestration graphs from
Baeten, M., Kyndt, E., Struyven, K., & Dochy, F. (2010). Using student-centred learning wearable sensor data. J. Comput. Assist. Learn., 34(2), 193–203.
environments to stimulate deep approaches to learning: factors encouraging or Prosser, M., & Sze, D. (2014). Problem-based learning: student learning experiences and
discouraging their effectiveness. Educ. Res. Rev., 5(3), 243–260. https://doi.org/ outcomes. Clin. Linguist. Phon., 28(1–2), 131–142. https://doi.org/10.3109/
10.1016/j.edurev.2010.06.001 02699206.2013.820351
Baker, R. S. (2019). Challenges for the future of educational data mining: the baker Roberts, M. E., Stewart, B. M., Tingley, D., & Others. (2014). stm: R package for structural
learning analytics prizes. JEDM | Journal of Educational Data Mining, 11(1), 1–17. topic models. J. Stat. Software, 10(2), 1–40.
Baker, R. S., & Inventado, P. S. (2014). Educational data mining and learning analytics. In Romero, C., & Ventura, S. (2020). Educational data mining and learning analytics: an
Learning Analytics (pp. 61–75). New York: Springer. https://doi.org/10.1007/978-1- updated survey. WIREs Data Mining and Knowledge Discovery, 10(3). https://doi.org/
4614-3305-7_4. 10.1002/widm.1355
Calvet Li~n  A. (2015a). Educational data mining and learning
an, L., & Juan Perez, A. Şahin, M., & Yurdugül, H. (2019). Educational data mining and learning analytics: past,
analytics: differences, similarities, and time evolution. RUSC. Universities and present and future e gitsel veri madencili €g
gi ve o renme analitikleri: dünü, bugünü ve
Knowledge Society Journal, 12(3), 98. https://doi.org/10.7238/rusc.v12i3.2515 gelece gi. Bartin Univ. J. Fac. Educ., 9(1), 121–131. https://doi.org/10.14686/
Calvet Li~n  A. (2015b). Educational data mining and learning
an, L., & Juan Perez, A. buefad.606077
analytics: differences, similarities, and time evolution. International Journal of Sergis, S., & Sampson, D. G. (2017). Teaching and learning analytics to support teacher
Educational Technology in Higher Education, 12(3), 98–112. inquiry: a systematic literature review. In A. Pe~ na-Ayala (Ed.), Learning Analytics:
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: Fundaments, Applications, and Trends: A View of the Current State of the Art to Enhance E-
how humans interpret topic models. Adv. Neural Inf. Process. Syst., 22, 288–296. Learning (pp. 25–63). Springer International Publishing.
Chen, X., Zou, D., Cheng, G., & Xie, H. (2020a). Detecting latent topics and trends in Siemens, G., & Baker, R. S. J.d. (2012). Learning analytics and educational data mining.
educational technologies over four decades using structural topic modeling: a Proceedings of the 2nd International Conference on Learning Analytics and Knowledge -
retrospective of all volumes of computer & education. Comput. Educ., 103855. LAK, ’12, 252. https://doi.org/10.1145/2330601.2330661
Chen, X., Zou, D., & Xie, H. (2020b). Fifty years of British Journal of Educational Steiner, C. M., Kickmeier-Rust, M. D., & Albert, D. (2014). Learning analytics and
Technology : a topic modeling based bibliometric perspective. Br. J. Educ. Technol., educational data mining: an overview of recent techniques. Learning Analytics for and
51(3), 692–708. in Serious Games, 6, 61–75.
Dormezil, S., Khoshgoftaar, T., & Robinson-Bryant, F. (2020). Differentiating between Taddy, M. (2012). On estimation and selection for topic models. Artificial Intelligence and
educational data mining and learning analytics: a bibliometric approach. CEUR Statistics, 1184–1193.
Workshop Proceedings, 2592(May), 17–22. Tsai, Y., Poquet, O., Gasevic, D., Dawson, S., & Pardo, A. (2019). Complexity leadership in
ElSayed, A. A., Caeiro-Rodríguez, M., MikicFonte, F. A., & Llamas-Nistal, M. (2019). learning analytics: drivers, challenges and opportunities. Br. J. Educ. Technol.: Journal
Research in learning analytics and educational data mining to measure self-regulated of the Council for Educational Technology, 50(6), 2839–2854.
learning: a systematic review. In World Conference on Mobile and Contextual Learning Vatrapu, R., Teplovs, C., Fujita, N., & Bull, S. (2011). Towards visual analytics for
(pp. 46–53). teachers’ dynamic diagnostic pedagogical decision-making. Proceedings of the 1st
Ferguson, R. (2019). Ethical challenges for learning analytics. Journal of Learning International Conference on Learning Analytics and Knowledge, 93–98.
Analytics, 6(3), 25–30. Vermunt, J. D., & Donche, V. (2017). A learning patterns perspective on student learning
Ferguson, R., Hoel, T., Scheffel, M., & Drachsler, H. (2016). Guest editorial: ethics and in higher education: state of the art and moving forward. Educ. Psychol. Rev., 1–31.
privacy in learning analytics. Journal of learning analytics, 3(1), 5–15. https://doi.org/ https://doi.org/10.1007/s10648-017-9414-6
10.18608/jla.2016.31.2 Vescio, V., Ross, D., & Adams, A. (2008). A review of research on the impact of
Fu, Q. K., & Hwang, G. J. (2018). Trends in mobile technology-supported collaborative professional learning communities on teaching practice and student learning. Teach.
learning: a systematic review of journal publications from 2007 to 2016. Comput. Teach. Educ., 24(1), 80–91. https://doi.org/10.1016/j.tate.2007.01.004
Educ., 119(July 2017), 129–143. https://doi.org/10.1016/j.compedu.2018.01.00 Vieira, C., Parsons, P., & Byrd, V. (2018). Visual learning analytics of educational data: a
Hutt, S., Gardner, M., Duckworth, A. L., & D’Mello, S. K. (2019). Evaluating fairness and systematic literature review and research agenda. Comput. Educ., 122, 119–135.
generalizability in models predicting on-time graduation from college applications. https://doi.org/10.1016/j.compedu.2018.03.018
International Educational Data Mining Society. http://files.eric.ed.gov/fulltext/ED5 Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods
99210.pdf. for topic models. In Proceedings of the 26th Annual International Conference on Machine
 Sheard, J., Skupas, B., Spacco, J., Szabo, C., Toll, D.,
Ihantola, P., Rivers, K., Rubio, M.A., Learning (pp. 1105–1112). Association for Computing Machinery.
Vihavainen, A., Ahadi, A., Butler, M., B€ orstler, J., Edwards, S. H., Isohanni, E., Wenger, E. (1998). Communities of practice: learning, meaning, and identity. In Learning
Korhonen, A., & Petersen, A. (2015). Educational data mining and learning analytics in Doing. Cambridge University Press.
in programming. Proceedings of the 2015 ITiCSE on Working Group Reports - ITICSE- Wise, A. F., & Jung, Y. (2019). Teaching with analytics: towards a situated model of
WGR, ’15, 41–63. https://doi.org/10.1145/2858796.2858798 instructional decision-making. Journal of Learning Analytics, 6(2), 53–69.
Leitner, P., Ebner, M., & Ebner, M. (2019). Learning analytics challenges to overcome in Xia, L., & Zhong, B. (2018). A systematic review on teaching and learning robotics content
higher education institutions. In D. Ifenthaler, D.-K. Mah, & J. Y.-K. Yau (Eds.), knowledge in K-12. Comput. Educ., 127(122), 267–282. https://doi.org/10.1016/
Utilizing Learning Analytics to Support Study Success (pp. 91–104). Springer j.compedu.2018.09.007
International Publishing. Xie, H., Chu, H. C., Hwang, G. J., & Wang, C. C. (2019). Trends and development in
Leontev, A. N. (1981). Problems of the development of the mind. In Progress. technology-enhanced adaptive/personalized learning: a systematic review of journal
Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing publications from 2007 to 2017. Comput. Educ., 140(July 2018), 103599. https://
semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical doi.org/10.1016/j.compedu.2019.103599
Methods in Natural Language Processing (pp. 262–272). Xing, W., Guo, R., Petakovic, E., & Goggins, S. (2015). Participation-based student final
Pandur, M. B., Dobsa, J., & Kronegger, L. (2020). Topic modelling in social sciences: case performance prediction model through interpretable Genetic Programming:
study of web of science. In Central European Conference on Intelligent and Information integrating learning analytics, educational data mining and theory. Comput. Hum.
Systems. Behav., 47, 168–181. https://doi.org/10.1016/j.chb.2014.09.034
Papamitsiou, Z., & Economides, A. A. (2014). Learning analytics and educational data
mining in practice: a systematic literature review of empirical evidence. Journal of
Educational Technology & Society, 17(4), 49–64.

14

You might also like