Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Companion Proceedings 13th International Conference on Learning Analytics & Knowledge (LAK23)

Talamas-Carvajal, J. A. (2023, March 13-17). The Middle-Man Between Models and Mentors: SHAP Values to Explain Dropout Prediction
Models in Higher Education [Poster presentation]. Learning Analytics and Knowledge Conference, Arlington, Texas, USA.
https://www.solaresearch.org/wp-content/uploads/2023/03/LAK23_CompanionProceedings.pdf

The Middle-Man Between Models and Mentors: Using SHAP Values to


Explain Dropout Prediction Models in Higher Education
Juan Andrés Talamás Carvajal
Tecnológico de Monterrey
juan.talamas@tec.mx

ABSTRACT: One of the challenges of prediction or classification models in education is that the
best performing models usually come in a "black box", meaning that it is almost impossible for
non-data scientists (and sometimes even experienced researchers) to understand the
rationale behind a model prediction. In this poster we show how SHAP (SHapley Additive
exPlanations) values can be used for model explainability as a baseline, and how this same tool
might be used for further variable analysis and possibly even bias detection by obtaining SHAP
values and figures for two dropout prediction models trained with student data from two
different educational models implemented in the same University.

Keywords: CCS Concepts: • Applied computing → Education; Learning management systems.


Additional Key Words and Phrases: Dropout, XAI, AI fairness

1 INTRODUCTION

This poster aims to demonstrate the usefulness of SHAP (SHapley Additive exPlanations) (Lundberg et
al, 2018) as a tool for stakeholders which can help better understand a machine learning model and
as a way to visualize possible bias in education. Our case study is based around the identification of
students at risk of dropping out using a machine learning approach, with the variables used being
general demographic information, extracurricular activities, and previous school level grades and
information. We show these results for two distinct educational models from Tecnológico de
Monterrey. The educational models are notoriously distinct from one another, something we will take
advantage of to demonstrate SHAP as a general explainability tool.

2 MATERIALS AND METHODS

2.1 SHAP and Shapley values

SHAP is an explainability tool based on a game theory approach to fair distribution called Shapley
values, where several players (model features) interact together to obtain a payout (prediction). The
Shapley values refer to the marginal contribution of each player to the difference between the
expected value (average) and the real value.

SHAP has the following properties that allow them to serve as reliable explanations to a particular
model: Local accuracy: the explainer approaches the true model for a specific input as other values
are removed; Missingness: a missing value or 0 has no effect on model impact; Consistency: if a model
changes to a point where an input's contribution increases or stays the same, that input's explainer
value should not decrease. (Lundberg et al, 2018)
Creative Commons License, Attribution - NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
Companion Proceedings 13th International Conference on Learning Analytics & Knowledge (LAK23)

Talamas-Carvajal, J. A. (2023, March 13-17). The Middle-Man Between Models and Mentors: SHAP Values to Explain Dropout Prediction
Models in Higher Education [Poster presentation]. Learning Analytics and Knowledge Conference, Arlington, Texas, USA.
https://www.solaresearch.org/wp-content/uploads/2023/03/LAK23_CompanionProceedings.pdf

It is important to note that SHAP can be applied both to linear and non-linear models. While the
“Additive” part of the name might point towards linearity, this refers only to the process to arrive at
individual predictions, and not to a necessity of a linear model.

2.2 Dataset

The dataset used in this paper was obtained from Tecnológico de Monterrey (Alvarado-Uribe, 2022).
It features anonymized information of undergraduate students who have enrolled and attended at
least one semester from 2014 to 2020. The data presented in this paper is available upon request in
the Institute for the Future of Education’s Educational Innovation collection of the Tecnológico de
Monterrey’s Data Hub at https://doi.org/10.57687/FK2/PWJRSJ (Alvarado-Uribe, 2022). The dataset
contains data from Tec20 (“Classic” educational model) and Tec21 (Competence based educational
model). The models differ greatly in their structure and processes, which inspired their comparison.

3 RESULTS

Ranking variable importance can be helpful as a first step, but it is with Swarm plots that SHAP values
start to show their worth. Figure 1 shows both educational models side by side. These plots show a
vertical ranking (y-axis) and a mapping of how each variable and its specific value impact the model
output. An individual point's color indicates a high (red), low (blue), or purple-ish (intermediate) value
on the variable its showing, while its position (x-axis) shows how that value impacted model output.
Each point constitutes a single student’s score in each feature.

Creative Commons License, Attribution - NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
Companion Proceedings 13th International Conference on Learning Analytics & Knowledge (LAK23)

Talamas-Carvajal, J. A. (2023, March 13-17). The Middle-Man Between Models and Mentors: SHAP Values to Explain Dropout Prediction
Models in Higher Education [Poster presentation]. Learning Analytics and Knowledge Conference, Arlington, Texas, USA.
https://www.solaresearch.org/wp-content/uploads/2023/03/LAK23_CompanionProceedings.pdf

Figure 1: Side by side comparison of Swarm plots for Tec20 (left) and Tec21 (right) educational
models

A quick example: A low value for "english.evaluation" on the Tec20 model (left) is indicated in blue
and shows a positive SHAP value, pushing the model output towards our target variable
(Dropout=True). That is, having a low "english.evaluation" increases the risk of student dropout.
Swarm plots allow for quick discovery of variable effects that is both intuitive and informative. We
could conclude from figure 1 that, in average, participation in leadership and culture activities (high
values in "culture" and "leadership" variables) lead to student retention predictions in the Tec20
educational model, while low scores in the admission test and older students tend to dropout
predictions in the Tec21 educational model.

4 DISCUSSION AND CONCLUSION

Using SHAP values to identify the most important overall variables, along with the general effect of
those variables' values makes for an extremely powerful tool for average educational practitioners.
Speaking to a student tutor or mentor about model precision and recall won't increase their trust on
machine learning, but showing them a swarm plot like the ones above allows them to use their
expertise to more easily understand model decisions and rationale. However, we believe that SHAP
values can go even further.

Creative Commons License, Attribution - NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
Companion Proceedings 13th International Conference on Learning Analytics & Knowledge (LAK23)

Talamas-Carvajal, J. A. (2023, March 13-17). The Middle-Man Between Models and Mentors: SHAP Values to Explain Dropout Prediction
Models in Higher Education [Poster presentation]. Learning Analytics and Knowledge Conference, Arlington, Texas, USA.
https://www.solaresearch.org/wp-content/uploads/2023/03/LAK23_CompanionProceedings.pdf

Classic feature importance can be easily obtained from other tools, but the information available from
SHAP values shown in the swarm and waterfall plots allows for any reasonably competent user to
make their own analysis without the need of data science training, making the tool both transparent
in its decisions, and giving stakeholders the necessary information to make data driven decisions. One
possible use would be to identify variables that could introduce bias (gender is a classic example) and
verify their overall effect on the model. If we find bias towards or against a specific group, it could be
an indicator of a problem with the data collection, or even on a more systematic level.

Future research will focus on expanding the explainability and usefulness of the models, starting with
the development of counterfactuals (how much a variable score needs to change to flip the prediction)
to provide a viable path towards "breaking the prophecy" of the predictions of our machine learning
models. In other words, finding what a student at risk of dropout needs and is able to change to reduce
that risk.

REFERENCES

Lundberg, S. M., Nair, B., Vavilala, M. S., Horibe, M., Eisses, M. J., Adams, T., Liston, D. E., Low, D. K.
W., Newman, S. F., Kim, J., & Lee, S. I. (2018). Explainable machine-learning predictions for the
prevention of hypoxaemia during surgery. Nature Biomedical Engineering, 2(10), 749–760.
https://doi.org/10.1038/s41551-018-0304-0
Alvarado-Uribe J, Mejía-Almada P, Masetto Herrera AL, Molontay R, Hilliger I, Hegde V, Montemayor
Gallegos JE, Ramírez Díaz RA, Ceballos HG. (2022) Student Dataset from Tecnologico de
Monterrey in Mexico to Predict Dropout in Higher Education. Data. 7(9):119.
https://doi.org/10.3390/data7090119

Creative Commons License, Attribution - NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

You might also like