Early Prediction of Student Performance Using Neural Networks FINAL

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Early Prediction of Student Performance

using Neural Networks


An exploratory study on the efficacy of a neural network model
at predicting students’ final grades in a flipped classroom environment.

Thesis report submitted in partial fulfillment of the requirements


for the degree of
Mathematical Physics, BSc.

by
Derrick Bailey
Under the supervision of
Dr. Kenneth C. Walsh

Department of Physics
Oregon State University
May 25, 2022
Certificate

This is to certify that the work contained in this thesis entitled “Early Prediction of Stu-
dent Performance using Neural Networks" is a bonafide work of Derrick Bailey, carried
out in the Department of Physics, Oregon State University under my supervision and that it has
not been submitted elsewhere for a degree.

Corvallis, OR
May 25, 2022

Dr. Kenneth C. Walsh


Department of Physics
Oregon State University
Acknowledgments

I express my deepest gratitude towards my advisor, mentor, and friend, Dr. Kenneth C.
Walsh, director of Project Boxsand, Department of Physics, Oregon State University,
for the constant help and encouragement from the very beginning of this project. I have been
fortunate to have such a person guide me in this project, having given me the freedom to explore
possibilities and construct my own vision. Thank you for working with me through tumultuous
times and always being available for conversation.
I want to thank my mother for always pushing me to be better and ensuring I was always
in a position to make my own choices and do great things.
A special person who believed in me and sparked my interest in science is a Dr. Joseph
Armin, my 5th grade elementary teacher. He inspired me to be a thinker and inadvertently
unlocked this path for me. Forever will I be grateful and fond.
My relationship with mathematics would not be possible had it not been for Jennifer
Stutzer, my Algebra II, Pre-Calculus & Trigonometry, and Calculus I & II teacher in high
school. Mrs. Stutzer is single-handedly responsible for helping me find a passion for mathe-
matical problem-solving and changed my view towards mathematics from disgust and fear to
intrigue and power. I would not be in this position today if it were not for her. An excellent
role model and instructor who I learned much from,
Additional thanks is warranted to Robert Peter Miller and Kameron Ransom, friends
who were always willing to listen and offer advice.
Lastly, I want to acknowledge the contribution of various Society of Physics Students (SPS)
members at Oregon State University. SPS was a fantastic environment filled with wonderful
people that I’m grateful to have been a part of. Special thanks to:
Basie Seibert, President
Ian Diaz, Vice President
Maxwell Siebersma, Treasurer
Alexa Zaback, Secretary
and all other members whom I interacted regularly with and found solid friendship and
advice: Christopher Gamboa, José Medina-Hernandez, Ethan Steele, Christopher
Magone, Adam Levenburg, Leah Holmes, Magnus L’Arguent, Dustin Treece, Abbie
Glickman, Thomas Knudson, Tyler Norgen, Logan Holler, Jaden Downing.

Corvallis, OR
May 25, 2022

Derrick Bailey
Contents

List of Figures v

Abstract vii

1 Introduction 1
1.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Asking the Right Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Immediately Preceding Work . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Computational Methodology 4
2.1 Methodological Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Previous Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Gated Recurrent Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Study Parameters 16
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Physics Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Data Breakdown & Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Exploratory Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Feature Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Results, Discussion & Comparison 22


4.1 Week 11 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Week 4 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Conclusion & Further Work 32

iii
Appendix 34

Bibliography 46
List of Figures

1.1 Relative Feature Importance and Mean Absolute Error Plot of Student Grade
Predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Random Forest Decision Tree Example . . . . . . . . . . . . . . . . . . . . . . . . 5


2.2 Simple Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Simple Neural Network: Deeper Overview . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Single-Layer Feedforward Artificial Neural Network . . . . . . . . . . . . . . . . . 8
2.5 Single GRU Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 GRU Update Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 GRU Reset Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 GRU Current Memory Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 GRU Output Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 2 Layer LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


3.2 2 Layer LSTM with Double Dropout . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Feature Correlation Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 F2019 Week 11 Unsegregated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23


4.2 F2019 Week 11 Male-Identifying Cohort . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 F2019 Week 11 Female-Identifying Cohort . . . . . . . . . . . . . . . . . . . . . . 24
4.4 F2019 Week 11 1st Generation Cohort . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 F2019 Week 11 Non-1st Generation Cohort . . . . . . . . . . . . . . . . . . . . . 25
4.6 F2019 Week 11 White Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.7 F2019 Week 11 Non-White Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.8 F2019 Week 4 Unsegregated 1st Attempt . . . . . . . . . . . . . . . . . . . . . . . 26
4.9 F2019 Week 4 Unsegregated Refined Hyperparameters . . . . . . . . . . . . . . . 27
4.10 F2019 Week 4 Unsegregated Further Refined Hyperparameters 1 . . . . . . . . . 27

v
4.11 F2019 Week 4 Unsegregated Further Refined Hyperparameters 2 . . . . . . . . . 27
4.12 F2019 Week 4 Unsegregated Cohort . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.13 F2019 Week 4 Male Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.14 F2019 Week 4 Female Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.15 F2019 Week 4 1st Gen Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.16 F2019 Week 4 Non-1st Gen Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.17 F2019 Week 4 White Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.18 F2019 Week 4 Non-White Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1 8 Layer LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


5.2 2 Layer LSTM with External Dropout, Partial Dataset . . . . . . . . . . . . . . . 36
5.3 2 Layer LSTM Double Dropout, Partial Dataset . . . . . . . . . . . . . . . . . . . 36
5.4 2 Layer LSTM Double Dropout, 500 Epoch, Partial Dataset . . . . . . . . . . . . 37
5.5 2 Layer LSTM Double Dropout, Full Dataset . . . . . . . . . . . . . . . . . . . . 37
5.6 2 Layer LSTM Double Dropout, Units = 2x Features . . . . . . . . . . . . . . . . 38
5.7 2 Layer LSTM Double Dropout, Units = 3x Features . . . . . . . . . . . . . . . . 38
5.8 2 Layer LSTM Double Dropout, Units = 48 . . . . . . . . . . . . . . . . . . . . . 39
5.9 4 Layer LSTM, Randomized Hyperparameters 1 . . . . . . . . . . . . . . . . . . . 39
5.10 4 Layer LSTM, Randomized Hyperparameters 2 . . . . . . . . . . . . . . . . . . . 40
5.11 Week 4 Hyperparameter Tuning 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.12 Week 4 Hyperparameter Tuning 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.13 Week 4 Hyperparameter Tuning 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.14 Week 4 Hyperparameter Tuning 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.15 Week 4 Hyperparameter Tuning 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.16 Week 4 Hyperparameter Tuning 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.17 Week 4 Hyperparameter Tuning 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.18 Week 4 Hyperparameter Tuning 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.19 Week 4 Hyperparameter Tuning 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.20 Week 4 Hyperparameter Tuning 10 . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.21 Week 4 Hyperparameter Tuning 11 . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.22 Week 4 Hyperparameter Tuning, No Exam Solutions Accessed, No Lab, Doubled
Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.23 Week 4 Hyperparameter Tuning, No Exam Solutions Accessed, No Lab . . . . . . 44
Abstract

With the unique conceptual and mathematical backgrounds of every introductory physics
student, there is no way to design a rigid curriculum that anticipates and meets every student’s
needs. We can employ machine learning to explore the correlations that exist in the voluminous
data from educational settings. That is, institutional, gradebook, and clickstream – a student’s
interaction with the course: e.g. total syllabus views. It is then possible to predict not only a
student’s course grade before the first exam of the term, but also to expose the factors that most
heavily influence this grade calculation. Through the weighting of individual features, such as,
for example, numericals: grade-point average (GPA) pre-class quizzes, and binary classifications:
gender and previous physics experience; deep learning can explore patterns in the data previously
unseen and deliver accurate predictions with evidence.
A Gated Recurrent Unit (GRU) algorithm was employed with great success on predictions
from complete term data (aggregate from weeks zero through eleven). Mean average error
(MAE) and mean squared error (MSE) values were reduced to less than 3 and 14, respectively,
for training and predictions performed on the unsegregated cohort. Demographic breakdowns
proved mostly promising, with respectable MAE and MSE hovering around the 5 and 30 values,
respectively, when averaged. The issue with this model is its poor performance with low numbers
of students. For example, for the Fall 2019 1st Generation cohort, there were only 25 students
who identified as 1st Generation and so the model had difficulty finding patterns in the data,
leading to weaker predictions.
Chapter 1

Introduction

Traditionally in academia, there is a single overarching method of instruction coupled with


an expected level of understanding (i.e., course pre-requisites). This staple of higher education
is created with the presupposition that each student enters a course with the required schema
to understand the material presented. This is a blanket, and often incorrect, assumption. The
question becomes: how is it possible to ensure every student has access to the resources and
materials necessary to succeed? Each student is unique, therefore, it must be possible to tailor
intervention specifically to groups of students or even the individual. The key to this is educa-
tional data. The amount of data that a conventional university holds on students is enormous –
looking exclusively at institutional data alone (overall GPA, race, sex, credits, etc.) could pro-
vide large insights into patterns of instruction and student performance. Coupling institutional
data with data generated from student-course interaction amounts to a mountain of untapped
information.
Machine learning may be the key to finding new and potentially unexpected correlations in
student performance: it is a tool that vastly improves efficiency in analyzing data and unlocks
new possibilities in understanding large data sets focused on human individuals, often in ways not
immediately obvious or even deducible by traditional data analysis. Employing a neural network
to analyze institutional and academic term-generated student performance data (clickstream
and gradebook data) allows for insights that are both not readily available and traditionally
missing. By comparing student metrics to gradebook data, it is possible to form a model
that predicts a students’ grade. Incorporating a large feature set, including, but not limited
to: institutional GPA, homework/quiz/exam scores, homework/quiz subset scores (problem-by-
problem breakdown associated with learning objectives), engagement with material, engagement
with other students; allows for the possibility of training a neural network to both predict an
individual students’ grade and that same students’ knowledge gaps. If, given multiple sets of
combined data (e.g., fall, winter, spring terms), correlations between individual features and
overall course grade could be identified and it may be possible to offer intervention to students
before the first midterm of the quarter.

1
1.1 Background and Related Work

1.1.1 Asking the Right Questions

In an effort to understand student learning in the flipped classroom environment, one could
simply ask the instructor and receive a fairly accurate overview of student performance trends:
traditionally, what students do well on, what concepts come across to the majority of students,
and so forth, including the contrapositives. This human insight is theoretically replicable, given
enough raw data and compute time. In this data exists features: classic things like GPA,
homework scores, and credits earned, but also previously unquantified indicators like aggregate
syllabus clicks and homework solution interaction. The first goal of this study is to determine
whether or not neural networks are a worthwhile investment over random forest regression at
predicting students’ final grades. This was explored in two phases. First, by using an end-
of-term (week 11) dataset. The model was trained on this dataset representing the Fall 2019
algebra-based introductory physics cohort. Predictions were then examined to determine the
ability of the neural network to even predict final grades. Second, the same model was then
trained on a truncated version of the dataset representing the same cohort between weeks zero
and four - crucially, before the first exam occurs.

1.1.2 Immediately Preceding Work

In a 2020 undergraduate thesis on student grade prediction by Michael B. Mauer, advised


by Kenneth C. Walsh, random forest regression models were used to study the importance of
features throughout each week of a ten-week term [13]. Figure 1.1 is an example of this relative
feature importance.

Figure 1.1: Relative Feature Importance Plot of Student Grade Predictors and Mean Absolute
Error Plot Comparing Predicted on Actual Grades, Week 4 of 10, [13].

Figure 1.1 also shows the mean absolute error (MAE) of the grade prediction by their random

2
forest regression model. MAE is the average of aggregate absolute error, calculated as:
n
1X
M AE = |yi − ŷi |
n
i=1

where yi is the actual output value and ŷi is the predicted output value. In this case, the
absolute difference between a students’ actual grade and their predicted grade. MAE is the
defining metric of this study to determine predictive power.
Notably, by week 4 of the term, Mauer and Walsh found that a student’s academic GPA had
the largest influence on their course grade. By week 5, after the first exam, the relative feature
importance of the first exam is similar to that of GPA. This is expected as each exam is worth
15% of a students’ final grade. What is lacking, however, in these predictions is the granularity
required to examine a student’s course performance when non-clickstream heavyweight features
are selectively removed. Clearly, a student’s GPA is a fairly good indicator of expected course
performance, but it does not tell the whole story. By determining the relative feature importance
of an entire term’s worth of data on a week-by-week basis, the progression of relative feature
importance can be studied (e.g., in week 2, a student’s GPA is the biggest indicator of expected
performance, but by week 5, the first midterm will be an equally large indicator). This feature
importance “weighting” is then used in a decision tree to predict a particular students’ grade.
By the end of the term, the data from Mauer and Walsh clearly show that exam grades, followed
closely by weekly homework grades, are the biggest features indicative of a students’ performance
in the course.
In a 2019 study by John Stewart et. al, machine learning, logistics and random forest
regressions were also used to study student performance in physics courses [15]. However, feature
sets and methodology differ from Mauer and Walsh. Notably, in Stewart, the classification was
binary: students who received an A or a B were coded as a “1” and students otherwise were
coded as a “0”, whereas in Mauer and Walsh, the full range of grade percentage was used.
In Stewart et. al., Table V depicts the performance parameters for the default random forest
regressor package in the programming language R, denoted "Default", and then tuned, denoted
"Overall", by adjusting the decision threshold in Stewart et. al.’s Supplementary Material [37].
Table V indicates that, with a sample size of N = 1683, the institutional and in-class week-
by-week varied by only 0.03 ± 0.01. However, the DFW Accuracy - students who receive a D,
F, or W (withdraw) - ranged from 0.53 ± 0.05 in Week 1 to 0.68 ± 0.05. The regression model
employed by Stewart et. al. had high accuracy in predicting students who were not at risk, but
struggled to predict those at risk [15].

3
Chapter 2

Computational Methodology

4
2.1 Methodological Theory

2.1.1 Previous Methods

Techniques such as k-means clustering [4] or random forest decision trees [13] [15] have been
used in educational research. These approaches, while having been proven with varying degrees
of success, do not satisfactorily address the “uniqueness” problem – how to tailor these algorithms
to specific students with great accuracy in prediction.
K-means clustering is a powerful tool to form group classifications, but lacks the detail depth
required. For instance, k-means clustering can suggest relationships amongst a specific data set
– e.g., k-means clustering could group types of students from a specific term. It would be
possible to group students based on peer-learning responses as an example of near-quartered
distribution; it would also be possible to group students on a pass/fail basis as an example of
nonlinear clustering. However, due to the nature of the algorithm, this ultimately does not
provide the level of depth required: k-means clustering could not accurately determine, within
a few percentage points, a students’ overall grade in the course based on smaller sets of data as
compared to training data or data that it is not trained on.
Random forest is a collection of techniques and algorithms used to manipulate data, often
for prediction, regression, or classification purposes. In order to understand random forests, it
is necessary to understand decision trees. Decision trees are algorithms which split the dataset
into homogeneous groups based on the given measured variables and then measure to what
degree these splits have created subsets that are maximally homogeneous. The ideal goal is for
the algorithm to continue splitting and categorizing the data until each subgroup is perfectly
homogeneous – however, this often results in overfitting: a severe problem in data analysis which
results in a given method corresponding too closely to a particular set of data, making it more
likely to fail at its task when given similar but not identical data. To eliminate overfitting,
often a limit is instituted on the decision tree which balances complexity with predictive power.
Random forests, as an extension of decision tree algorithms, differ in that, instead of a single
tree for a model, thousands of trees are grown. Figure 2.1 depicts the general structure of a
random forest decision tree.

Figure 2.1: Random Forest Decision Tree Example, Tensorflow Blog, by Mathieu Guillame-Bert,
Sebastian Bruch, Josh Gordon, Jan Pfeifer, Tensorflow blog.

Random forests are a powerful tool, and often are the less computationally expensive and

5
easier to implement solution for many machine learning problems. An aim of this study is to
determine whether extensive tuning of a neural network could beat the predictive power that
lies with a properly constructed random forest algorithm in the case of student grades, given an
identical dataset. Generally, random forest algorithms can predict accurately with less data –
however, with large sets of data, it is possible that a neural network could be trained to recognize
patterns in smaller sets of data, potentially out-performing a random forest algorithm [13] [5].
Additionally, with smaller sets of data, as typically seen in course-specific datasets, the question
becomes how neural networks can compare to random forests.

2.1.2 Neural Networks

Neural networks are, strictly speaking, an attempt to replicate human brain structure and
function in non-human silicon. Neural networks consist of several simple but highly intercon-
nected units, called “neurons”, organized into discrete layers that process information using static
or dynamic state responses to inputs.

Figure 2.2: Simple neural network, by Rajat Gupta, Medium.com


.

Neural networks operate by pattern recognition – patterns that are far too complex to be
manually extracted. By taking a series of inputs and feeding them through hidden layers (so-
called hidden because they are not an input nor output layer) where processing of weights and
biases occurs and then comparing the result to a pre-set activation function which determines
whether that specific neuron “fires” or is activated and passed along. These computations are
then compiled into an output neuron(s). A single layer neural network has the following com-
putational flow:

6
Figure 2.3: Simple neural network with 2 inputs, 1 hidden layer, and 2 outputs, including biases
and weights. By Matt Mazur, Medium.com
.

where in are the inputs, hn are the hidden nodes, and on are the outputs. bn represent
biases, either externally given or recursively computed, and wn are the weights of features
applied between every computation, viewed as transitioning between layers.
Into these hidden layers are fed weight and bias matrices. Weights come from the input, and
biases are initially externally applied. For unsupervised learning, these matrices are initially
entirely zero and have a square shape of the size of the input. The input depends on the logical
relationship in the data. In recurrent neural networks0 (RNN), there are four main logical
relationships: many-to-one, one-to-many, many-to-many, and one-to-one. For simplicity and
relevance, we will examine the many-to-one approach. There exists a sequence of data as the
input, which is then mapped to a single output. To explain the general idea, consider this
example: five students can work on a single project for class. How hard the students work on
their sections of the project contributes to the overall grade the project receives. In this example,
we have 5 inputs (the students) and a single output (the project grade) with a bunch of hidden
layers between the two (the effort put in by each student).
0
Recurrent neural networks are derived from feedforward neural networks and form a temporal sequence
among nodes, which allows for temporal dynamic behavior.

7
Figure 2.4: A single-layer feedforward artificial neural network. Arrows originating from x2 are
omitted for clarity. There are p inputs and P q outputs. In this system, the value of the q th
output, yq , would be calculated as yq = K ∗ ( (xi ∗ wiq ) − bq .

More specifically, each hidden layer has an activation function. The activation function
defines the output of a neuron given a set of inputs. One of the most common activation
functions is the Rectified Linear Unit (ReLU) function:

f (x) = max(0, x)

where x is the input to a neuron. ReLU is easy to implement and very quick to compute the
gradient, given its linear nature and true zero value, however, since it cannot be differentiated
when x=0. This results in the “dying ReLU problem”, or “vanishing gradient”. There exists
no gradient when the output of a neuron is zero, and if a large enough portion of neurons
zero out, this can lead to poor performance of the model. Additionally, in our case, ReLU is

8
not appropriate for recurrent neural network applications, specifically due to this "vanishing
gradient". For our purposes, we will employ the default activation and recurrent activation
functions for a Gated Recurrent Unit (GRU), discussed in the next section.

2.1.3 Gated Recurrent Unit

A relatively new implementation in the space of recurrent neural networks is the Gated Re-
current Unit (GRU). Introduced in 2014 in Kyunghyun Cho et. al., GRUs aim to solve the
vanishing gradient problem common to recurrent neural networks [8]. GRUs are considered a
variation on the Long-Short Term Memory (LSTM) architecture, which are a common architec-
ture used in machine learning for temporal prediction. Gated Recurrent Units have also been
shown to perform better on smaller and/or less frequent datasets [8] [9]. To explain GRU ar-
chitecture, we will pull from the excellently written “Understanding GRU Networks” by Simeon
Kostandinov (Understanding GRU Networks | Towards Data Science) written Dec. 16, 2017, on
towardsdatascience.com, and “A Tutorial on Backward Propagation Through Time (BPTT)
In The Gated Recurrent Unit (GRU) RNN”, by Minchen Li, Department of Computer Science,
The University of British Columbia (BPTTTutorial.pdf (ucla.edu)).
To solve the vanish gradient issue, GRUs utilize two unique gates: an update gate and a
reset gate. In every layer, there is an activation function and a recurrent activation function:
hyperbolic tangent and sigmoid, respectively. The hyperbolic tangent function is:

ex −e−x
f (x) = ex +e−x

The sigmoid function is:

1
f (x) = 1+e−x

These functions are important in the architecture of a single GRU layer, depicted in Figure
2.5.

9
Figure 2.5: An overview of a single GRU layer, by Simeon Kostandinov [12].

Update Gate

The update gate helps the model to determine how much past information to utilize in future
calculations.

10
Figure 2.6: Highlight of the update gate in a GRU, by Simeon Kostandinov [12].

The update gate for time step t, zt is calculated as follows:

zt = σ(Uz xt + Wz st−1 )

When an input xt is delivered to a unit, it is multiplied by its corresponding weight Wz and


summed with the previous t − 1 unit-held pieces of information, denoted st−1 , and multiplied by
its own weight Uz . A sigmoid activation function, σ, is applied to constrain the result between
0 and 1.

Reset Gate

The reset gate determines how much past information from the model to forget. The reset
gate is calculated as follows:

rt = σ(Ur xt + Wr st−1 )

Similar to the update gate, st−1 and xt are multiplied by their own weights, summed, and have
the sigmoid function applied.

11
Figure 2.7: Highlight of the reset gate in a GRU, by Simeon Kostandinov [12].

The reset gate stores the relevant information from the past time step into the new memory
content (discussed in the next section). It then multiplies the input vector and hidden states
with their weights and performs element-wise multiplication between the current reset gate and
the previous hidden state and aggregates the results. Applying an activation function to these
results produces ht .

Current Memory

To utilize past information, we introduce a new memory content which will use the reset
gate to store the relevant information from previous time steps. It is calculated as follows:

ht = tanh(Uh xt + Wh (st−1 ⊙ rt ))

First, multiply the input xt with its weight U and st−1 with weight W . Calculate the Hadamard
product between the reset gate rt and U st−1 , which determines what to remove from previous
time steps. Applied to the sum of the previous is the non-linear tanh function, depicted in
Figure 2.8.

12
Figure 2.8: Highlight of the current memory gate in a GRU, by Simeon Kostandinov [12].

Time Step Output

Finally, the network calculates st , a vector which holds information for the current unit and
passes it along the network to further layers. It is calculated as follows:

st = 1 − zt ⊙ ht + zt ⊙ st−1

st requires the update gate to determine what to collect from current memory content. Apply
the element-wise multiplication to the update gate zt and st−1 , and again for (1 − zt ) and ht
and sum.

13
Figure 2.9: Highlight of the output gate in a GRU, by Simeon Kostandinov [12].

Backpropagation

A strategy used in neural network architecture is the idea of backpropagation – the ability of
a network to optimize the weights so that it may learn how to correctly map inputs to outputs,
minimizing error both for the individual output neuron and the network as a whole. In essence,
the model learns the relationships between the inputs and finds patterns in the data [16]. For
some time step t, the GRU unit computes the output yˆt using input xt and the previous internal
state st−1 as follows:

zt = σ(Wz xt + Uz st−1 + bz )

rt = σWr xt + Ur ht−1 + br

ht = tanh(Uh xt + Wh (st−1 ⊙ rt ) + bh )

st = (1 − zt ) ⊙ ht + zt ⊙ st−1

yˆt = [act. func.] · (V st + bV ), V ∈ RnV ×ni , bV ∈ RnV ×1

Where any bi , i = z, r, h is a possible bias with respect to its gate. Firstly, backpropagation
relies on the gradient computation, which itself depends on the error computation:

14
Et = −yt log(yˆt )

where yt is the target value and yˆt is the tth scalar value in the model output. Total error is
then

1 PN
E= N t −yt log(yˆt )

To train the GRU network, we want to know values of all parameters that minimize total error
where θ = [Uz , Ur , Uc , Wz , Wr , Wc , bz , br , bc , V, bV ]. By nature, this is non-convex with massive
input data. The Stochastic Gradient Descent method is used to solve this, requiring that we
∂L ∂L ∂L
calculate ∂U , , , ∂L , ∂L , ∂L , ∂L , ∂L , ∂L , ∂L , ∂b
z ∂Ur ∂Uh ∂Wz ∂Wr ∂Wh ∂bz ∂br ∂bh ∂V
∂L
V
for each input “batch”. To compute
the gradient for a single input in a single layer, we sum the error at time step t:

∂E P ∂Et
∂Wt = t ∂Wt

where

∂Et ∂Et ∂h′t−1


=
∂Wt ∂h′t−1 ∂Wt
t
∂Et X ∂ht ∂ht
=
∂ht ∂hi ∂Wt
i=1
t t−1
∂Et X Y ∂hj+1 ∂si
= (( ) )
∂ht ∂hj ∂Wz
i=1 j=1

Similarly,

∂st ∂st ∂ht ∂st ∂zt ∂st


= + +
∂st−1 ∂hT ∂st−1 ∂zt ∂st−1 ∂st−1
∂st ∂ht ∂rt ∂ht ∂st ∂zt ∂st
= ( + )+ +
∂ht ∂rt ∂st−1 ∂st−1 ∂zt ∂st−1 ∂st−1

Intermediate results yield

∂st
∂st−1 = (1 − st )(WrT ((WhT (1 − h ⊙ h)) ⊙ st−1 ⊙ r ⊙ (1 − r)) + ((WhT (1 − h ⊙ h)) ⊙ rt ) +
WzT ((st−1 − ht ) ⊙ zt ⊙ (1 − zt )) + z

Now, we have all necessary pieces to calculate the gradient of an error at a particular time.
These calculations repeat every epoch.

15
Chapter 3

Study Parameters

16
3.1 Motivation

What, exactly, comprises a student’s grade? Obvious answers include showing up for class,
homework and exam scores. However, educational data mining has revealed that factors con-
tributing to grades are not black and white [3] [15] [13] [6] [14] [4]. There is a layer - a deep layer
- of complexity intertwining every aspect of a course to a students’ performance, quantizable
and not. What, in particular, are the driving course-related factors of students’ grades? For a
cohort, and for an individual? Is there a difference in the factors? For demographic groups?

3.2 Physics Cohort

The student cohort was pulled from an algebra-based introductory physics sequence, specifi-
cally the 2019 fall term (PH201). This first course focuses on introducing students to the basics
of kinematics, Newtonian mechanics, momentum, and energy. More information can be found
on the Boxsand website: PH20X General Info). The student demographic breakdown occurs as
follows:
Table 3.1: Demographic Breakdown

Total Students 347


Male 124
Female 223
1st Gen 79
Non-1st Gen 268
White 226
Non-White 121

There were more students enrolled in the PH201 course, however, students who dropped the
course, did not take an exam, or took an incomplete were not included in any calculations and
thus not reported in the demographics. Demographics were collected from institution-provided
data with IRB approval.

3.3 Data Breakdown & Parsing

Educational data was collected from and can be categorized as 3 distinct categories: insti-
tutional, gradebook, and clickstream, presented in Table 3.2.
Where institutional refers to student metrics reported by the university, gradebook refers to
traditional classroom scores and clickstream refers to any and all student interactions with the
Boxsand website. For example, if a student clicks on the syllabus, a YouTube video, and then a
practice exam, a log of each event is recorded with the student ID, a timestamp, and a link.
To generate the featureset in totality, firstly, a unique student ID list is gathered from each
source and then compared to one another. Students who were present in either institutional or
gradebook data but not clickstream data were deleted from the dataset. Each source is parsed
individually to catch student ID’s and associated features. For institutional and gradebook data,
this is straightforward as each student ID is unique to feature entries, with the exception of pre-
and post-lecture assignments. Students are allowed to submit an unlimited number of times
to achieve a maximal score, and every submission is recorded. To parse this, submissions are

17
Table 3.2: Feature List

Institutional Gradebook Clickstream


Overall OSU GPA Pre-Lecture Score # Midterm 1 Solutions Accessed
OSU GPA Post-Lecture Score # Midterm 2 Solutions Accessed
OSU Credits Attempted Homework Score # Final Solutions Accessed
OSU Credits Earned Recitation Score # Practice Exams Accessed
Gender Lab Grade # Calendar Accessed
1st Generation Midterm 1 Score # Fundamental Accessed
Regulatory Race Midterm 2 Score # Videos Accessed
Final Grade

grouped by student ID and then further parsed for a maximum attempt and maximum score.
It was assumed that the maximum attempt would be directly related with the maximum score.
In clickstream data, a summation is performed for student IDs on each feature recorded in
clickstream data.

3.4 Exploratory Work

Initially, the purpose of this project was to explore the possibility of grade prediction with
machine learning and, if successful, to further refine the depth of the predictions and breadth
of possibility. It was unknown how well a neural network would be able to perform numerical
predictions on this type of data set, which prompted experimentation.

Figure 3.1: A 2 layer long-short term memory neural network. The first reproducible, if not
particularly accurate, prediction.

18
Long-short term memory (LSTM) neural networks were first utilized due to their ability
to "remember" weights and biases from the previous epoch and their particular performance
with respect to temporal data. Of note, it was discovered that including some form of dropout
in the layers was of significant importance to achieve regularization in the training process.
This is referred to as dilution in machine learning, reducing overfitting by preventing complex
co-adaptations on training data and randomly removing training data to thin the weights.

Figure 3.2: A 2 layer long-short term memory neural network with dropout occurring after each
layer.

A series of tests were performed with various numbers of LSTM layers, dropout layers,
epochs, and unit manipulation. These can be further explored in the appendix.
A paper on a relatively new evolution of LSTMs, Empirical Evaluation of Gated Recurrent
Neural Networks on Sequence Modeling, by Chung et. al., [9], prompted investigation into the
feasibility of GRUs over LSTMs, primarily in their speed compared to LSTMs with approxi-
mately the same level of accuracy.

3.5 Hyperparameter Tuning

Perhaps the most important aspect of neural networks is their hyperparameters and the
tuning of them. Blindly tweaking these "knobs and dials" can sometimes lead to a desired
result - in this case, more accurate predictions indicated by tight clustering and a lower MAE
and MSE - but more often than not will lead to unusable outputs (see 8 layer LSTM in the
appendix). However, initially, blind tuning was performed. Choices were made arbitrarily in
order to see the outcome. This is akin to a "manual gradient descent" in which the surface that
was being navigated was an n-dimensional surface corresponding to grade prediction, where n

19
is the number of features. The full list of hyperparameters is as follows: number of epochs,
batch size, learning rate, recurrent dropout for each layer, dropout between each layer, and
the number of units for each layer. In total, assuming 3 layers, there are 12 hyperparameters.
The number of epochs determines how many cycles are performed. An epoch is training the
neural network with all the training data for one cycle. In an epoch, all of the data is used
exactly once. A forward pass and a backward pass together are counted as one pass: an epoch is
made up of one or more batches, where part of the dataset is used to train the neural network.
The batch size defines the number of samples that will be propagated through the network.
For example, with 1000 training samples and a batch size of 100, the algorithm will take the
first 100 samples from the training dataset and trains the network. Then, it takes the next
set of 100 and trains the network again, and so forth. In this example, 10 trainings occur in a
single epoch. Dropout and recurrent dropout are very similar: both are a float value between
0 and 1 representing a percentage of data to be randomly regularized. Dropout refers to the
fraction of units to drop for the linear transformation of the inputs, whereas recurrent dropout
refers to the fraction of units to drop for the linear transformation of the recurrent state. The
number of units in each layer corresponds to a neuron or unit has typically represented a single
object, usually with one or more weighted input connections, a transfer function that combines
the inputs in some way, and an output connection. Efforts were made to perform brute force
gradient descent methods, but the sheer number of initial possible combinations made this
computationally impossible. Automatic stochastic gradient descent methods to optimize the
hyperparameters were next investigated, but ultimately were not explored further because of
time constraints, project focus, and the required work: a brand new algorithm would have been
necessary to create to accomplish this task, and it was deemed too expensive a time investment.
After some experimentation, a brute-force hyperparameter optimization approach was taken:
the algorithm changed a single hyperparameter by an allowed amount, ran an entire iteration,
and compared MAE and MSE values. If both were better than a previous iteration, those values
were kept as the new baseline. If one was better, but not the other, or both were not, the
iteration was considered not an improvement and the perturbed hyperparameter was perturbed
further. If again there was no improvement, the hyperparameter was reverted back to baseline
and the next hyperparameter underwent the process. This continued for an extensive period of
time until, with developer oversight, the model appeared to settle into a sufficient minima and
produced acceptably low MAE and MSE values. A table of typical hyperparameter values can
be found in the appendix.
Additionally, model callbacks were implemented to improve the efficiency of the model.
Tensorflow’s early stop callback was used to stop training of the model early (before the maximum
allowed epochs have elapsed) when a minimum specified improvement (minimum delta) in a
monitored category (monitor ) is not met over a certain amount of epochs (patience). Every
time a value greater than the minimum delta in the monitor occurs, the patience counter is
reset. If this minimum delta is not exceeded and patience elapses, the model training is stopped
and fitting occurs.

3.6 Feature Correlation

To better understand the relationship between features and grade prediction, a heatmap
was generated to show the neural network’s understanding of the relative importance of the
featureset. Of particular note is the final column, for final_Grade, whereby values can be read
indicating overall importance. Unsurprisingly, GPA dominates the prediction weighting, followed
by homework scores. Interestingly, aggregated on-site video interaction (KALTURA) was next
highest in importance, suggesting that the more students interacted with hosted material, the

20
better their overall grade.

Figure 3.3: Correlation heatmap of features used in grade prediction. Of note is the final column,
final_Grade, suggesting feature importance to the overall prediction.

21
Chapter 4

Results, Discussion & Comparison

4.1 Week 11 Predictions

One may present the question: if the entirety of the gradebook is present, why does the
model not perfectly predict a student’s grade? The entire gradebook is not present. It is
specifically missing, in Week 11, reflective writing. Additionally, by the very nature of machine
learning, and the goal of this project, we are attempting to combine multiple different umbrellas
of input feature types: institutional, clickstream, and gradebook, thus the model doesn’t only
see gradebook data, but has to learn patterns amongst all three. This is why we do not see
perfect predictions. Additionally, the grading scale for PH20X is skewed: full credit is given for
over 85% correctness for pre- and post-lecture quizzes. Late assignments are capped at 50% late
penalty. The grading scale is also shifted from traditional:

85-100 A
80-84 A-
77-79 B+
68-76 B
65-67 B-
62-64 C+
50-61 C
45-50 D
0-45 F

which directly conflicts with a 4.0 GPA scale as the GPA scale assumes traditional grading.
These are just some of the hurdles that the model must overcome.
The following graphs were generated by demographic breakdown using end-of-term data
(week 11). The left graph corresponds to grade prediction, and the right graph indicates training
behavior. Notably, the lower the MAE and MSE, generally, the better the prediction, and this
can be seen also in the learning behavior, where the loss is minimized.

22
Figure 4.1: F2019 Unsegregated population. Left: Predicted vs. Actual Grades, with a Mean
Average Error (MAE) of 2.856 and a Mean Squared Error (MSE) of 13.858. Right: Loss of the
training and testing sets over epochs. The testing set settled into a minima fairly quickly, within
15 epochs or so. The training set took approximately 50 epochs to settle into a minima.

The unsegregated population predicted values of 2.856 and 13.858 for MAE and MSE, re-
spectively. Visually, clustering looks good and tracks nearly along the diagonal which represents
a perfect 1:1 predicted on actual grade. Examining the loss plot (right), loss is minimized in the
testing dataset. The model quickly settled into a valley for both training and testing sets after
approximately 20 epochs and then marginally decreased the loss over an additional 60 epochs.

Figure 4.2: F2019 Male-Identifying population. Left: Predicted vs. Actual Grades, with a Mean
Average Error (MAE) of 5.758, and a Mean Squared Error (MSE) of 40.136. The predicted grades
trended low, within 5%, for nearly every student. Right: Loss of the training and testing sets
over epochs. Both the training and testing sets settled into their minima around 40 epochs in.
The training set started with a large loss, >0.12, but came down quickly.

The male-identifying population predictions performed worse as compared to the unsegre-


gated population, overall being underpredicted. This could be due to several influences: to
be "real life" factors, rather than computational, and/or a lower population to test against.
Certainly, the lower testing population contributes to some extent, but external factors may be
unquantifiably present. Comparatively, the male-identifying population were predicted twice as
less accurately as the female-identifying population (Figure 4.3) and the unsegregated popu-
lation (Figure 4.1), and roughly equal to the Non-1st Generation population (Figure 4.5) and
the White population (Figure 4.6) The data here shows us that something is happening, but
research into social influences on educational performance is needed to suggest anything.

23
Figure 4.3: F2019 Female-Identifying population. Left: Predicted vs. Actual Grades, with a
Mean Average Error (MAE) of 2.754, and a Mean Squared Error (MSE) of 21.839. The predicted
grades trend slightly towards overprediction. The top end of the grade distributions were nicely
clustered within a few percent of the expected, but the overprediction trend grows, albeit slowly,
the further down the grade scale. Right: Loss of the training and testing sets over epochs. Both
the training and testing sets settled quickly into their minima: the testing set took approximately
20 epochs, whereas the training set took approximately 60. However, the training set continued
to improve over time, albeit marginally.

The female-identifying population predictions performed even better than the unsegregated
population in terms of MAE, however, due to several outliers, the MSE was worse. Grouping was
tight, and loss was minimized, however, the gradient descent took nearly 150 epochs to settle
into a valley, whereas the unsegregated run took only approximately 80. Social factors may
pressure women to perform more consistently in an educational environment, reflected in Figure
4.3 by the extremely tight clustering. Systematic sexism in education serves to disadvantage
female-identifying persons by forcing them to work harder for similar levels of praise, recognition
of achievement, and opportunity. [2][7] [1]. It therefore is not a stretch to imagine that this
pressure forces female-identifying persons to work harder for identical amounts of reward, setting
them up for the effort that is university-level courses.

Figure 4.4: F2019 1st Generation population. Left: Predicted vs. Actual Grades, with a Mean
Average Error (MAE) of 6.260, and a Mean Squared Error (MSE) of 77.272. The high error and
overall lack of clustering in the predictions is believed to be a result of insufficient data points
- i.e., there are not enough students to perform training and/or testing on. Right: Loss of the
training and testing sets over epochs. The volatility of the loss suggests that the instability in
the data cannot be rectified by the model.

24
Predicting 1st generation students was less successful, and this is believed to be exclusively
due to the lack of sufficient data. A group of approximately 80 students is not sufficient for this
specific GRU with these specific hyperparameters to accurately predict grades, even with week
11 data. Therefore, temporal prediction was not explored due to such a small dataset.

Figure 4.5: F2019 Non-1st Generation population. Left: Predicted vs. Actual Grades, with
a Mean Average Error (MAE) of 5.561 and a Mean Squared Error (MSE) of 42.552. The
predictions area skewed towards underprediction, albeit not linearly. The lower the actual grade,
the more the model underpredicted. Right: Loss of the training and testing sets over epochs.
Both settled into their minima around 60 epochs into the training.

Interestingly, the non-1st generation dataset is skewed towards underprediction at the lower
end of the grading scale, but comes back in line towards the higher end. Clustering is good
despite the underprediction and loss is minimized. Some similar parallels can be drawn between
the male-identifying, female-identifying, and non-1st generation datasets, but the population
here isn’t high enough to draw even speculative conclusions.

Figure 4.6: F2019 White population. Left: Predicted vs. Actual Grades, with a Mean Average
Error (MAE) of 5.726 and a Mean Squared Error (MSE) of 43.425. The entirety of the population
experienced underprediction, roughly linear to the grade scale, of approximately 5%. Right: Loss
of the training and testing sets over epochs. Both settled into their minima around 60 epochs.

The white demographic predictions were, surprisingly, largely underpredicted despite notice-
able clustering and good loss minimization. Drawing on knowledge about bias in the education
system, systematic racism serves to advantage the white population [10]. For example, predom-
inantly minority school districts tend to perform worse on college entrance exams (ACT, SAT)
than predominantly white school districts [11]. This disparity in preparation level would suggest
that the white population would have a lower MAE and cluster better, but this is the opposite

25
of what is seen. In comparison to Figure 4.7, the white population has over double the MAE
and MSE, strongly indicating that the prediction efficacy and clustering is worse than in the
non-white population.

Figure 4.7: F2019 Non-White population. Left: Predicted vs. Actual Grades, with a Mean
Average Error (MAE) of 2.872, and a Mean Squared Error (MSE) of 16.970. Right: Loss of the
training and testing sets over epochs. Both settled into their minima around 50 epochs.

The non-white demographic predictions were fairly accurate and tracked along the 1:1 pre-
diction line. The decrease in error for prediction on the non-white population is thought to be,
at least partially, the result of educational pressure on minorities [10] [11].

4.2 Week 4 Predictions

Notably in week 4, the first exam has not yet occurred, and so the predictive power of the
model is called into question. Initially, the model had difficulty producing any favorable results,
denoted in Fig. 4.8.

Figure 4.8: F2019 Week 4 Unsegregated 1st Attempt

The primary difference between Week 4 and Week 11 datasets is that Week 4 does not
include exam solutions accessed or laboratory percentage but does include discrete pre- and
post-lecture attempts and score (Week 11 does not). Because of the large increase in the number
of subfeatures from pre- and post-lecture learning objectives (nearly double), the number of units
in the GRU layers were discovered to be vastly insufficient. Increasing the number of units in
each layer by a factor of ten produces Fig. 4.9.

26
Figure 4.9: F2019 Week 4 Unsegregated Refined Hyperparameters

A further substantial increase in the number of units in each layer led to the following figures:

Figure 4.10: F2019 Week 4 Unsegregated Further Refined Hyperparameters 1

Figure 4.11: F2019 Week 4 Unsegregated Further Refined Hyperparameters 2

Iterations of hyperparameter tuning led to the following, finalized Week 4 predictions. Week
4 hyperparameter values can be found in the appendix, Table 5.2.
Note the change in loss graph labeling: train and test have become Loss on Training Set
and Loss on Validation Set. The colors have not changed.

27
Figure 4.12: F2019 Week 4 Unsegregated population. Left: Predicted vs. Actual Grades, with
a Mean Average Error (MAE) of 5.515, and a Mean Squared Error (MSE) of 49.726. Right:
Loss of the training and testing sets over epochs. There was a large spike in the training set
around 20 epochs, suggesting that the training set began to walk up the gradient rather than
down. Both sets quickly settled into a minima around epoch 50, and made small improvements
over time.

Figure 4.13: F2019 Week 4 Male-Identifying population. Left: Predicted vs. Actual Grades,
with a Mean Average Error (MAE) of 5.955, and a Mean Squared Error (MSE) of 52.062.
Right: Loss of the training and testing sets over epochs. The large spike in training loss around
epoch 60 suggests suggesting that the training set began to settle into a local minima while the
validation (testing) set continued to walk down the gradient. Both found their local minima after
approximately 100 epochs, however, the unstable behavior of the loss after this point suggests
there is further tuning to be done.

28
Figure 4.14: F2019 Week 4 Female-Identifying population. Left: Predicted vs. Actual Grades,
with a Mean Average Error (MAE) of 4.549, and a Mean Squared Error (MSE) of 31.669.
Interestingly, the model did not predict anyone over 85%. Right: Loss of the training and
testing sets over epochs. The large spike in the loss suggests the model began to walk up the
gradient until settling into a minima around 65 epochs.

Figure 4.15: F2019 Week 4 1st-Generation population. Left: Predicted vs. Actual Grades, with
a Mean Average Error (MAE) of 6.526, and a Mean Squared Error (MSE) of 72.433. Right:
Loss of the training and testing sets over epochs. The model quickly settled into a false valley
and the loss climbed quickly, but it began to resume gradient descent and the model settled into
a minima around 60 epochs.

29
Figure 4.16: F2019 Week 4 Non-1st Generation population. Left: Predicted vs. Actual Grades,
with a Mean Average Error (MAE) of 6.078, and a Mean Squared Error (MSE) of 62.415. Right:
Loss of the training and testing sets over epochs. A very large spike in loss at 5 epochs continued
to climb through epoch 10 until the model settled into a minima at epoch 30.

Figure 4.17: F2019 Week 4 White population. Left: Predicted vs. Actual Grades, with a
Mean Average Error (MAE) of 7.877, and a Mean Squared Error (MSE) of 92.650. The model
struggled to predict, evidenced by the circular clustering and significant outliers. Right: Loss
of the training and testing sets over epochs. The model really struggled to find the direction of
the gradient despite the number of epochs. It appears the training set started to descend, but
the testing set did not.

30
Figure 4.18: F2019 Week 4 Non-White population. Left: Predicted vs. Actual Grades, with a
Mean Average Error (MAE) of 6.672, and a Mean Squared Error (MSE) of 63.816. Given the
lack of population, the model was able to find patterns well enough to reach the MAE stated
previously. Right: Loss of the training and testing sets over epochs. There was a large spike of
both training and testing loss around 60 epochs, but it settled by epoch 90 and then continued
to slowly decrease over the course of several hundred epochs, suggesting the model found an
absolute minimum.

All demographic predictions exhibit a large, localized spike in loss analysis on the training
dataset. Some exhibit a co-located spike in loss analysis on the testing dataset. These spikes
occur for a multitude of reasons but a few likely ones are: loss-dependence, where loglikelihood-
losses need to be clipped lest evaluation near log(0) which causes an exploding gradient; if
a significantly dominant portion of the dataset (>99.9%) has optima at a point except some
discrete observation, then the spikes may be said observation forcing the algorithm out of the
local minima when randomly selected to a batch, parameterized, and computed. Clearly, the
predictive power of the model exists for pre-exam datasets, but its efficacy is less than that
of 11 week, full term datasets. This is expected since exams and lab score weight the grades
themselves and the predictions so heavily, as shown previously.
Comparing Week 4 results to Mauer & Walsh, we see a 0.415 reduction in MAE and signif-
icant increases in overall clustering [13]. Mauer & Walsh, utilizing a random forest regressor,
managed a 5.930 MAE in their examination of the unsegregated population. This suggests the
GRU model has notable improvements in predictive power over random forest. Also, the MSE
coupled with clustering suggests that the model has not overfit to the cohort.
Comparing to Stewart et. al., Stewart’s Figure 1 and Table 3 (representing Physics I) denote
an accuracy of 0.78 for logistical regression and 0.75 for random forest for week 5 predictions
(when Stewart et. al. introduced in-class variables) - a decrease of 16.485 and 19.485 from
the MAE given in Figure 4.12, respectively. [15]. Stewart’s Table 7 (Physics II) denotes an
accuracy of 0.84 and 0.81 for logistical regresion and random forest, respectively, These values
are between 11 and 14 points lower than the MAE depicted in Figure 4.12. The GRU model
produces an accuracy of approximately 95%, and Stewart et. al. reports an average accuracy
of 83%, plus or minus a few percentage points depending on the model examined in Stewart et.
al. Overall, Stewart et. al.’s models do not present the same level of accuracy as found in this
paper, suggesting there is significant merit to the neural network approach.

31
Chapter 5

Conclusion & Further Work

32
Deep learning neural networks have shown promise in grade prediction, on par with or
exceeding random forest decision trees. Notably, however, is their ability to perform accurately
unsupervised and the ability to extract per student weight matrices, which paints the picture of
what exactly contributed most to a students’ predicted grade. Week 11 predictions performed
on the Fall 2019 PH201 physics cohort (filtering out students who took an incomplete in the
course or were missing one or more exams) conformed to expectations. Weight matrix analysis
lead to qualitative discoveries, such as the flipping of bias for summative interaction with the
course syllabus before (positive bias) the first exam to after (negative bias). Week 4 predictions
exceeded expectations in exploratory work, leading to excellent predictive efficacy across the
entire cohort and allowing for further qualitative analysis in demographic breakdown.
The model began as a question in code form: can this work? What will it produce? Exploring
different types of neural network architectures, primarily Long-Short Term Memory (LSTM) and
Gated Recurrent Unit (GRU), we discovered the viability of temporal prediction, exhibited in
Figures 5.1-5.10. These predictions weren’t useful in any educational means, but showed that
the model can find patterns in educational data and predict final grades based on said data.
Further exploratory work was done developing the neural network algorithm to accept more
specific breakdowns of institutional, clickstream, and gradebook data, some of which had subdata
attached, again to investigate whether the model could interpret this. The GRU architecture
was constructed in such a way that it would be end-user friendly, allowing the user to specify
a data location, to name the week of the term, and to tell the model whether to optimize
hyperparameters or to use a preset. This flexibility was the ultimate goal: to develop a tool that
could be utilized by instructors in large physics instructional environments so that they may
directly support students.
Week 11 predictions exhibited low MAEs with the exception of the two demographics that
did not have sufficient population for the model to learn. Week 4 predictions exhibited higher-
than-Week 11 error, but the lack of two exams which contribute to a total of 30% of the grade
makes this expected. The MAEs were as follows:

MAE Week 11 MAE Week 4


2.856 Unsegregated 5.515 Unsegregated
5.758 Male-Identifying 5.955 Male-Identifying
2.754 Female-Identifying 4.549 Female-Identifying
6.260 1st Generation 6.526 1st Generation
5.561 Non-1st Generation 6.078 Non-1st Generation
5.726 White 7.877 White
2.872 Non-White 6.672 Non-White

MAEs tell us that the mean average error of the prediction of the model was X%. For
example, the Week 4 unsegregated population had a mean average error of 5.515%, meaning
that a student’s actual grade was, on average, within 5.515% of what the model predicted. This
is inside a single letter grade on a traditional grading scale. With this information, we know
the model has reliable predictive power. The ultimate goal is to use this information to assist
students where they need the most academic support in the physics instructional environment.
Introductory physics, whether algebra- or calculus-based, is difficult. The required problem
solving mindset is often one that students have not engaged in before. The intervention door
opened by these predictions could allow instructors to provide targeted, learning outcome-driven
support for individual students. The information gained here can be utilized in a plethora of
ways. Directed intervention, group support based on common subject struggles, curriculum
changes based on gradebook item influence, etc.

33
In the near future, work may be done to expand the featureset from a single term to en-
capsulate all three terms of the PH20X series. In general, for neural networks, the more input
data, the better the model, and this course of action would be expected to positively influence
the model. Additionally, other types of features such as sentiment analysis from natural lan-
guage processing of reflective writings in PH20X at Oregon State University, a project actively
being worked on by Jared Smith of the BoxSand group. Further, principal component analysis
(PCA) may be performed on the transformed dataset before it sees the neural network model,
reducing the dimensionality of the model and decreasing compute time. PCA may also lead to
information on the importance of specific learning objectives, assignments, etc. on a student’s
overall grade and thus may be influential for course design.

34
Appendix

Figure 5.1: An 8 layer LSTM. Performance was notably awful.

35
Figure 5.2: A 2 layer LSTM with dropout after each layer, no tuning of hyperparameters.

Figure 5.3: A 2 layer LSTM with both dropout in the layers and after each.

36
Figure 5.4: A 2 layer LSTM with both dropout in the layers and after each, run for 500 epochs.

Figure 5.5: A 2 layer LSTM with double dropout, run on the full student dataset, where the
units in each layer corresponded to the number of features.

37
Figure 5.6: A 2 layer LSTM with double dropout, where the units in each layer corresponded
to double the number of features.

Figure 5.7: A 2 layer LSTM with double dropout, where the units in each layer corresponded
to triple the number of features.

38
Figure 5.8: A 2 layer LSTM with double dropout, where the units in each layer corresponded
to an arbitrary choice of 48.

Figure 5.9: A 4 layer LSTM with randomized hyperparameters.

39
Figure 5.10: Another 4 layer LSTM with randomized hyperparameters.

The following is a collection of predictions trained on week 4 data, none particularly notable
but all important in the process of tuning hyperparameters. They vary in hyperparameter values.

Figure 5.11: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
1. Layer 2 Units: 1. Layer 3 Units: 1.

40
Figure 5.12: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
1. Layer 2 Units: 1. Layer 3 Units: 5.

Figure 5.13: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
1. Layer 2 Units: 1. Layer 3 Units: 25.

Figure 5.14: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
1. Layer 2 Units: 1. Layer 3 Units: 625.

41
Figure 5.15: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
1. Layer 2 Units: 5. Layer 3 Units: 1.

Figure 5.16: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
1. Layer 2 Units: 5. Layer 3 Units: 5.

Figure 5.17: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
1. Layer 2 Units: 5. Layer 3 Units: 25.

42
Figure 5.18: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
1. Layer 2 Units: 5. Layer 3 Units: 125.

Figure 5.19: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
125. Layer 2 Units: 25. Layer 3 Units: 5.

Figure 5.20: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
125. Layer 2 Units: 25. Layer 3 Units: 5.

43
Figure 5.21: Week 4 Hyperparameter Tuning. Epochs: 500. Batch Size: 1. Learning Rate:
0.001. Layer 1 Dropout: 0.15 Layer 2 Dropout: 0.15. Layer 3 Dropout: 0.15. Layer 1 Recurrent
Dropout: 0.2. Layer 2 Recurrent Dropout: 0.2. Layer 3 Recurrent Dropout: 0.2. Layer 1 Units:
125. Layer 2 Units: 125. Layer 3 Units: 5.

Figure 5.22: Week 4 Hyperparameter Tuning, No Exam Solutions Accessed, No Lab, Doubled
Epochs

Figure 5.23: Week 4 Hyperparameter Tuning, No Exam Solutions Accessed, No Lab

44
Table 5.1: Week 11 Hyperparameters

# Epochs 500
Batch size 1
Learning rate 1e-4
Recurrent dropout 1 0.15
Recurrent dropout 2 0.15
Recurrent dropout 3 0.15
Dropout 1 0.2
Dropout 2 0.2
Dropout 3 0.2
GRU Units 1 150
GRU Units 2 100
GRU Units 3 50

Table 5.2: Week 4 Hyperparameters

# Epochs 500
Batch size 1
Learning rate 1e-3
Recurrent dropout 1 0.15
Recurrent dropout 2 0.15
Recurrent dropout 3 0.15
Dropout 1 0.2
Dropout 2 0.2
Dropout 3 0.2
GRU Units 1 625
GRU Units 2 125
GRU Units 3 25

45
Bibliography

[1] Frank Merrett and Kevin Wheldall. “Teachers’ Use of Praise and Reprimands to Boys and
Girls”. In: Educational Review 44.1 (Jan. 1992). Publisher: Routledge _eprint: https://doi.org/10.1080/001
pp. 73–79. issn: 0013-1911. doi: 10.1080/0013191920440106. url: https://doi.org/
10.1080/0013191920440106 (visited on 05/25/2022).
[2] Madeleine Arnot et al. Recent Research on Gender and Educational Performance (Ofsted
Reviews of Research). June 1998. isbn: 978-0-11-350102-1.
[3] Alexandra C. Achen and Paul N. Courant. “What Are Grades Made Of?” en. In: Journal
of Economic Perspectives 23.3 (Sept. 2009), pp. 77–92. issn: 0895-3309. doi: 10.1257/
jep.23.3.77. url: https://www.aeaweb.org/articles?id=10.1257/jep.23.3.77
(visited on 01/20/2022).
[4] O. J. Oyelade, O. O. Oladipupo, and I. C. Obagbuwa. “Application of k Means Clus-
tering algorithm for prediction of Students Academic Performance”. In: arXiv:1002.2425
[cs] (Feb. 2010). arXiv: 1002.2425. url: http://arxiv.org/abs/1002.2425 (visited on
01/20/2022).
[5] Joaquin Vanschoren and Hendrik Blockeel. “Experiment Databases”. In: Inductive Databases
and Constraint-Based Data Mining. Journal Abbreviation: Inductive Databases and Constraint-
Based Data Mining. Nov. 2010, pp. 335–361. isbn: 978-1-4419-7737-3. doi: 10.1007/978-
1-4419-7738-0_14.
[6] Kuiyuan Li, Josaphat Uvah, and Raid Amin. Predicting Students’ Performance in Ele-
ments of Statistics. en. ISSN: 1548-6613 Publication Title: Online Submission. 2012. url:
https://eric.ed.gov/?id=ED537981 (visited on 01/20/2022).
[7] Pilar Montañés et al. “Intergenerational Transmission of Benevolent Sexism from Mothers
to Daughters and its Relation to Daughters’ Academic Performance and Goals”. en. In: Sex
Roles 66.7 (Apr. 2012), pp. 468–478. issn: 1573-2762. doi: 10.1007/s11199-011-0116-0.
url: https://doi.org/10.1007/s11199-011-0116-0 (visited on 05/24/2022).
[8] Kyunghyun Cho et al. “Learning Phrase Representations using RNN Encoder-Decoder
for Statistical Machine Translation”. In: arXiv:1406.1078 [cs, stat] (Sept. 2014). arXiv:
1406.1078. url: http://arxiv.org/abs/1406.1078 (visited on 01/20/2022).
[9] Junyoung Chung et al. “Empirical Evaluation of Gated Recurrent Neural Networks on
Sequence Modeling”. In: arXiv:1412.3555 [cs] (Dec. 2014). arXiv: 1412.3555. url: http:
//arxiv.org/abs/1412.3555 (visited on 01/20/2022).
[10] Drew S. Jacoby-Senghor, Stacey Sinclair, and J. Nicole Shelton. “A lesson in bias: The
relationship between implicit racial bias and performance in pedagogical contexts”. en. In:
Journal of Experimental Social Psychology 63 (Mar. 2016), pp. 50–55. issn: 0022-1031.
doi: 10.1016/j.jesp.2015.10.010. url: https://www.sciencedirect.com/science/
article/pii/S002210311530010X (visited on 05/25/2022).

46
[11] Richard V. Reeves and Dimitrios Halikias. Race gaps in SAT scores highlight inequality and
hinder upward mobility. en-US. Feb. 2017. url: https://www.brookings.edu/research/
race- gaps- in- sat- scores- highlight- inequality- and- hinder- upward- mobility/
(visited on 05/25/2022).
[12] Simeon Kostadinov. Understanding GRU Networks. en. Nov. 2019. url: https://towardsdatascience.
com/understanding-gru-networks-2ef37df6c9be (visited on 01/20/2022).
[13] Michael Mauer and Kenneth C. Walsh. “Using Machine Learning to Predict Student
Grades with Clicsktream Data from a Flipped Classroom Environment”. Undergraduate
Thesis. Department of Physics, Oregon State University, 2020.
[14] Juan L. Rastrollo-Guerrero, Juan A. Gómez-Pulido, and Arturo Durán-Domínguez. “An-
alyzing and Predicting Students’ Performance by Means of Machine Learning: A Review”.
en. In: Applied Sciences 10.3 (Jan. 2020). Number: 3 Publisher: Multidisciplinary Digital
Publishing Institute, p. 1042. doi: 10.3390/app10031042. url: https://www.mdpi.com/
2076-3417/10/3/1042 (visited on 01/20/2022).
[15] Jie Yang et al. “Using Machine Learning to Identify the Most At-Risk Students in Physics
Classes”. In: Physical Review Physics Education Research 16.2 (Oct. 2020). arXiv: 2007.13575,
p. 020130. issn: 2469-9896. doi: 10.1103/PhysRevPhysEducRes.16.020130. url: http:
//arxiv.org/abs/2007.13575 (visited on 01/20/2022).
[16] Minchen Li. “A Tutorial On Backward Propagation Through Time (BPTT) In The Gated
Recurrent Unit (GRU) RNN”. en. In: (), p. 15.

47

You might also like