Garthft Masters Thesis

The use of Data Mining for Predicting
Injuries in Professional Football Players
Garth Theron
Thesis submitted for the degree of

Master in Programming and System Architecture
60 credits
Department of Informatics
Faculty of Mathematics and Natural Sciences
UNIVERSITY OF OSLO
Spring 2020
The use of Data Mining for Predicting
Injuries in Professional Football Players
Garth Theron

c 2020 Garth Theron
The use of Data Mining for Predicting Injuries in Professional Football Players
http://www.duo.uio.no/
Printed: Reprosentralen, University of Oslo

Abstract
Injuries in professional football represent a financial burden to clubs and have a negative impact on
team performance. This has inspired a wide range of studies which attempt to gain insight into the risk
factors associated with injury. Until recently, the majority of research has been limited to univariate
studies. However, the emergence of machine learning has given rise to numerous multivariate studies.
The work covered in this thesis forms part of a partnership between the University of Oslo (UIO) and
the Norwegian School of Sports Science (NIH), which aims to research relationships between athlete
workloads, injury and illness. With this thesis serving as the first work undertaken in the partnership, two
goals are identified. The first is to build a data warehouse to store all available training, competition, injury
and illness data generated by a professional Norwegian football team. The data warehouse serves as a
unified representation of the clubs data, providing data for both the second goal of this thesis, as well as
future research. The second goal is to conduct a data mining study using player workload and injury data
to predict future injury.
In the first phase of the thesis, a data warehouse is constructed using a traditional four-phase modelling
approach. The choice of a data warehouse is primarily motivated by three factors, namely, user needs,
the granularity of the data available, and the update frequency of the data store. Combing data from
three source systems and several internal sources, a star schema consisting of a single fact table and six
dimensions is created. The fact table includes six workload measures, namely, total distance, acceleration
load, player load, V4 distance, V5 distance, and HSR distance. These are aggregated using the SUM
operation, and provide users with the ability to extract data at granularities ranging from one-second
intervals up to the level of an entire season. The six dimensions included in this thesis are date, session,
injury, illness, player and training log. As the final step in the design phase, a materialised view is created
and includes workload and injury data aggregated at the session-level. The data included in the view is
explicitly selected for the analysis needs of the data mining completed in this thesis and does not take into
consideration the analysis needs of future work.
In the second phase of this thesis, data mining is conducted using the CRISP-DM framework. With
the aim of predicting future injury from workload data, five phases from this framework are carried
out, namely business understanding, data understanding, data preparation, modelling, and evaluation.
Four objectives are identified, all of which aim to predict future injury using workload and injury data
gathered from a single season of training and competition. Two definitions of injury are provided to
investigate whether models perform better for specific injury types. Four models of varying complexity
and interpretability are used in the modelling phase and include decision trees (DT), random forests (RF),
logistic regression (LG), and support vector machines (SVM). Model evaluation makes use of four class
independent evaluation metrics, precision, recall, F1 score and area under the curve (AUC), to assess a
models predictive performance. Models of the relationship between workloads, a previous injury feature,
and injury show limited ability to predict future injury among players from a professional Norwegian
football team. Mean AUC scores are below 0.52 for all modelling approaches indicating that injury
predictions are no better than those expected by random chance. Precision scores are higher than recall
scores for all modelling approaches, with a highest score of 0.65 ± 0.17 being achieved. The inability to
achieve recall scores higher than 0.02 means that all modelling approaches generate a large number of
false-negative predictions, indicating that models are unable to identify the injury class.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Motivation for a Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Overview of Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 The CRISP-DM Process Model . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Machine Learning in Football . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Workloads in Predictive Modelling . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 A Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Source Data 13
3.1 GPS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Injury Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Training Log Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 First Impressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Data Warehouse Modelling 19

4.1 Requirements Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Analysis Driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 Source-Driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Conceptual Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iii
CONTENTS
4.2.1 The Initial Analysis-Driven Conceptual Schema . . . . . . . . . . . . . . . . . . 28

4.2.2 The Initial Source-Driven Conceptual Schema . . . . . . . . . . . . . . . . . . 30
4.2.3 Conceptual Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.4 The Final Conceptual Schema and Mappings . . . . . . . . . . . . . . . . . . . 31
4.3 Logical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 The Logical Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 Definition of the ETL Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Physical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.1 Materialised Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.2 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 The ETL Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Overview of the ETL Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.2 Cleaning of GPS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.3 Cleaning of Injury and Illness Data . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.4 Cleaning of Log Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.5 Load Date Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.6 Load Session Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.7 Load Player Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.8 Load Injury and Illness Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.9 Load Log Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.10 Load GPS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.11 Insert Special Members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Data Mining 55
5.1 Business Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.2 Injury Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.3 Data Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Summary Statistics for the Workload Features . . . . . . . . . . . . . . . . . . . 57
5.2.2 Summary Statistics for the Injury Data . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.3 Comparison of Workload Features for Injured and Non-Injured Players . . . . . 59
5.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 The Exponentially Weighted Moving Average (EWMA) . . . . . . . . . . . . . 62
5.3.2 The Mean Standard Deviation Ratio (MSWR) . . . . . . . . . . . . . . . . . . . 62
5.3.3 Injury Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.4 Creation of the Final Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
iv
CONTENTS
5.3.5 Correlation Matrix for the Features in the Final Data Set . . . . . . . . . . . . . 65
5.4 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 Adaptive Synthetic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.3 Stratified K-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.5 Features From Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.6 Modelling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5.3 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Conclusion 81
6.1 Data Warehouse Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Data Mining Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Appendices 91
A Source Code 93
B MultiDim Model for Data Warehouses 95
C BPMN Notation 98
v
List of Figures
1.1 Two phase approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 An illustration of a three-dimensional data cube . . . . . . . . . . . . . . . . . . . . . . 7

2.2 The CRISP-DM data mining life cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Phases in the data warehouse design process . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Steps in the requirements specification phase using an analysis/source-driven approach . 20
4.3 Steps in the conceptual modelling phase using an analysis/source-driven approach . . . . 26
4.4 Illustration of the many-to-many relationship between a fact and a dimension . . . . . . 26
4.5 A decomposition of the injury/illness dimension . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Unbalanced hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 The initial conceptual model using an analysis-driven approach . . . . . . . . . . . . . . 29
4.8 The initial conceptual model using an source-driven approach . . . . . . . . . . . . . . 30
4.9 The final conceptual model generated from the analysis-driven and the source-driven models 32
4.10 Steps in the logical modelling process . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.11 A logical data warehouse schema for football data . . . . . . . . . . . . . . . . . . . . . 35
4.12 The use of placeholders to transform unbalanced hierarchies into balanced hierarchies . . 36
4.13 SQL code for generating a materialised view . . . . . . . . . . . . . . . . . . . . . . . . 39
4.14 A query for selecting data from a specific session . . . . . . . . . . . . . . . . . . . . . 40
4.15 Overview of the ETL process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.16 Tasks involved in the cleaning of GPS training data . . . . . . . . . . . . . . . . . . . . 45
4.17 Loading of the Date dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.18 Loading of the Session dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.19 Loading of the Player dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.20 Loading of the Injury and Illness dimensions . . . . . . . . . . . . . . . . . . . . . . . . 50
4.21 Loading of the fact table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Visual overview of injury occurrence for the 2019 season . . . . . . . . . . . . . . . . . 60
vi
LIST OF FIGURES
5.2 Boxplots comparing the workload features of injured and non-injured players with respect
to NC injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Boxplots comparing the distance features of injured and non-injured players with respect
to NC injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Boxplots comparing the distance features of injured and non-injured players with respect
to NCTL injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 Representation of the final data set constructed in the data preparation phase . . . . . . . 64
5.6 Correlation matrix of the injury and workload features in the final data set . . . . . . . . 65
5.7 An illustration of a test/train split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8 An illustration of 5-Fold Cross Validation with ADASYN . . . . . . . . . . . . . . . . . 70
5.9 An illustration of Feature Elimination combined with K-Fold Cross Validation and ADASYN 72
5.10 Boxplots comparing the area under the curve for models using ADASYN and Stratified
K-Fold Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.11 Boxplots comparing the area under the curve for models using the previously identified
injury features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.12 Comparison of AUC scores for models using RFECV in combination with ADASYN and
Stratified K-Fold CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.13 Comparison of feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
B.1 Dimension level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

B.2 Fact table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B.3 Cardinatlities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
B.4 Dimension types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
B.5 Balanced hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
B.6 Ragged hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
C.1 General notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

C.2 More general notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
C.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
C.4 Gateways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
vii
List of Tables
3.1 Features extracted from Catapult OptimEye X4 . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Features recorded for player injuries and player illnesses . . . . . . . . . . . . . . . . . 15
3.3 The features recorded by a player after the completion of each training session . . . . . . 15
4.1 Fact summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Dimension summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Summary of the multidimensional elements using a source-driven approach . . . . . . . 25
4.4 An example of double-counting in a many-to-many dimension. . . . . . . . . . . . . . . 27
5.1 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Summary statistics after smoothing by bin means . . . . . . . . . . . . . . . . . . . . . 58
5.3 Injury statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Comparison of ML data set and data from data warehouse . . . . . . . . . . . . . . . . 64
5.5 A confusion matrix for binary classification . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6 Summary results of all models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.7 Confusion matrix for SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.8 Confusion matrix for DT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.9 A comparison of confusion matrices for SVM and DT models using NC data . . . . . . 75
viii
Acknowledgements
I would like to thank my supervisors, Vera Hermine Goebel, Thomas Peter Plagemann, Torstein Dalen-
Lorentsen and Thor Einar Gjerstad Andersen for their guidance and support during the writing of this
thesis. I would also like to express my gratitude to Anders Larsen for retrieving the GPS data required for
this project.
ix
Chapter 1
Introduction
1.1 Motivation
Injuries in professional football represent a major concern, as they negatively impact team performance
[15], and the process associated with player rehabilitation is often both costly and time-consuming.
Significant associations between low injury rates and increased performance have been shown in both
domestic and professional football teams [15]. This is particularly concerning as injuries among football
players are prevalent, and players are expected to incur around two injuries per season [9]. Prevention of
injuries is also of enormous importance for the development of individual players, as frequent injuries
may prevent players from achieving their maximum skill potential due to an absence from training and
competition [36].
Another point of concern is the economic impact of injuries, resulting from medical costs as well as
the missed earnings derived from a players popularity [30]. This represents a financial burden upon
professional teams, with the average cost of an injury to a Premier League player being estimated to the
value of £290000 1 .
With this in mind, it is not surprising that there is a growing interest in the field of injury forecasting among
researchers, managers and coaches. Despite numerous efforts, previous work in the evaluation of statistical
models for the prediction of injury have had little success. This may be attributed to a combination of two
factors, namely the limited availability of data describing the activity of players, and the fact that existing
studies have been limited by single-dimension estimations [39].
Advances in technology have provided researchers with enhanced player tracking, giving them access to
accurate workload data for players [41], as well as a wealth of open machine learning libraries. This has
subsequently lead to a shift in focus towards multi-parameter statistics [43].
1.2 Problem Statement
This thesis is inspired by recent research which has shown promising results in the prediction of injury
among football players using GPS data [39]. The intention behind this work is to conduct a similar study
1 https://www.forbes.com/sites/bobbymcmahon/2019/08/22/report-shows-that-an-injury-to-a-premier-league-player-costs-on-average-350000/
#5f22640410e2
1
CHAPTER 1. INTRODUCTION
using training and competition data from a professional football club in Norway. We propose a data mining
approach with a twofold purpose.
The first of these is to create a data warehouse which will serve as a unified representation of all the training
and competition data generated by the club. The purpose of the data warehouse is to support both the data
mining study presented in this thesis as well as future research in the area of football-related injury and
illness. The requirements specification for designing the data warehouse thus takes into consideration
not only the research needs of this thesis but includes the user needs of researchers at NIH. The aim for
the data warehouse is, therefore, to provide researchers with a unified multidimensional data store which
contains data relevant for the analysis of football-players’ workloads, injuries and illnesses, and which
supports a variety of online analytical processing (OLAP) operations.
The second purpose of this thesis is to conduct a data mining study which aims to predict the onset of
player injury. Following a data mining cycle outlined in Chapter 2, a data set is generated from summarised
workload and injury data. The following is then used to create prediction models using classification
techniques such as decision trees, random forests, support vector machines and logistic regression.
For the purpose of clarity, the user-needs relating to the data warehouse cover a broader scope than that of
the data mining requirements in this thesis. They will be referred to throughout the thesis as "project goals",
and are specified in Chapter 4, which covers data warehouse modelling. The "thesis goals" specifically
refer to the data mining goals for this thesis and are covered in Chapter 5.
1.3 Approach
Given the project goals presented in Section 1.2, and data collected from three source systems during the
2019 season, the problem statement for this thesis is approached in two parts:
1. The creation of a data warehouse.
2. Data mining to predict future injury.
An overview of the approach taken in this thesis is presented in Figure 1.1. The first phase involves the
creation of a data warehouse to aid the analysis goals of the project. Starting with the source systems,
data is first explored to gain an understanding of the data available. This is process involves tasks such
as identifying data formats, describing data metrics, and identifying potential issues which exist within
the data. Information gathered from the source data is then used in combination with user requirements
to construct a data warehouse. Following a traditional four-phase design approach, a warehouse is first
modelled before data is extracted, transformed and then loaded into the data warehouse. The final task in
this phase involves the creation of a materialised view, which provides summarised data necessary for the
next phase of the thesis.
On completion of the first phase, data mining is carried out following the first five phases of the framework
described in Chapter 2. With a primary goal of predicting future injury, clear objectives are first established
in the business understanding phase, before the data understanding phase is used to gain initial insights
into the data set. Based on these insights, a final data set is constructed in the data preparation phase, and
then used for the development of prediction models in the modelling phase. The second phase ends with an
evaluation of the prediction models’ performance and suggestions for improving the results of future work.
2
Figure 1.1: Two phase approach
1.4 Scope of this Thesis
Due to a combination of the broad subject area covered in this thesis, and the limited time frame available,
it is essential to provide a clear definition of the scope of this work. Section 1.2 presents two objectives for
this thesis, namely, the construction of a data warehouse and the prediction of future injury using data made
available from several source systems. The scope of the data warehouse is limited to creating a simple data
warehouse which can address the analysis needs of the data mining goals presented in this thesis, as well
as several additional analysis goals specified by researchers at NIH. This involves modelling a functional
data warehouse which provides access to data necessary for achieving these goals. Additionally, basic data
extraction, transformation and loading tasks are required to ensure that data is successfully loaded into the
warehouse in a format which is consistent with the warehouse modelled. As a result of the 2019 season
terminating in December, and the generation of source files being a timely endeavour, the vast majority of
the data used in this thesis was only made available at a late stage during this work. Little time is thus
available for extensive testing and performance enhancement, and for this reason, is not covered in great
detail.
Furthermore, the data used in this thesis is limited to workload, injury and illness data gathered during a
single season. As much of the related work in this field is conducted using multiple seasons of data, the
work presented in this thesis serves as the groundwork for future studies. Lastly, this thesis makes use of
all data made available by the football club. Since the commencement of this project, several studies have
produced findings using data which was unavailable at the time this work was carried out. This is noted
and discussed in the section concerning future work.
1.5 Structure of this Thesis
The structure for this thesis is as follows:
• Chapter 2 provides a background of the three subject areas covered in this thesis. Starting with data
3
warehousing, Section 2.1 provides a definition of what a data warehouse is and how it differs from a
traditional database, before providing an overview of typical data warehouse functions. Next, the
concept of data mining is defined in Section 2.2, and the data mining framework used in this thesis
is presented. Lastly, a summary of related work in the field of injury prediction is given.
• Chapter 3 provides a description of the three data sources, before commenting on their quality.
• Chapter 4 covers the four stages of data warehouse modelling. Using an approach which takes
into consideration both user requirements and the source data available, this chapter provides an
account of the processes involved in designing, implementing and loading the data warehouse for
this project.
• In Chapter 5 the first five phases of the data mining framework used in this thesis are presented.
Using data extracted from the data warehouse, a feature set is created and used in a series of
four modelling approaches to assess the ability of different machine learning models to predict
future injury. A discussion of the results is then presented, highlighting several of the limitations
encountered in this work.
• Finally, Chapter 6 summarises the work presented in this thesis and provides direction for the future
work of this project.
4
Chapter 2
Background
2.1 Data Warehousing
As discussed in the previous chapter, it is the goal of both the present study and future studies to be able
to gain insights into the relationship between training workloads and injury, illness and performance.
As a means of achieving this goal, a data warehouse is proposed as an architecture for integrating the
organisations data and supporting structured queries, analysis and decision making. The motivation for
such an architecture arises from the organisations need for a semantically consistent data store which can
serve a variety of research studies. Furthermore, a data warehouse provides online analytical processing
(OLAP) tools for the analysis of multidimensional data, which facilitates effective data generalisation and
data mining [17].
This section begins with a presentation of the key motivating factors for choosing a data warehouse before
a brief overview of several core data warehouse concepts is provided. As Chapter 4 provides a more
detailed account of many of these concepts, only a brief overview is presented in this section.
2.1.1 Motivation for a Data Warehouse
The choice of a data warehouse instead of a traditional database is motivated by three factors. The first
of these is associated with the project’s user needs, which primarily focus on analysis tasks. Operational
databases, or online transaction processing (OLTP) systems, are tuned to support the daily operations of an
organisation. Their primary concern is to ensure fast concurrent access to data, and are transaction-oriented.
Due to OLTP systems having to support heavy transactional loads, their design is focused on preventing
update anomalies, thus resulting in databases which are highly normalised. They, therefore, perform poorly
when aggregating large volumes of data, or executing complex queries which join many relational tables
[46]. Data warehouses, however, are designed to support online analytical processing (OLAP), which
focus on analytical queries and are designed to support heavy query loads.
The second motivating factor is that of data granularity and data aggregation. As is discussed in Chapter 3,
much of the project’s data is extracted at a very fine level of detail. This provides users with the ability
to analyse data at multiple levels of granularity. Despite OLTP and OLAP systems both providing users
with this functionality, OLAP systems are far better suited to aggregating large volumes of data. The
final motivating factor is the update frequency of the database. OLTP systems store and process data in
5
CHAPTER 2. BACKGROUND
real-time, involving operations which regularly insert, update and delete data. The research conducted in
this project, however, involves analysing data collected over an entire season and is not dependant upon
regular update operations which are typical in OLTP systems. For this reason, the projects database can
be updated relatively infrequently without affecting the success of the project. Lower update frequencies
are common in OLTP systems which focus primarily on read operations, and typically make use of batch
updates.
Given the analytical needs of users, the fine granularity of the source data, and the infrequency at which
the database needs to be updated, a data warehouse is considered an ideal architecture for integrating the
organisation’s data and supporting the project’s research.
2.1.2 Overview of Data Warehousing
A data warehouse can be defined as a repository of integrated data obtained from several sources for the
specific purpose of multi-dimensional data analysis [46]. The data collected in data warehouses share
several characteristics which are worth mentioning. The first of these is that data are subject-oriented,
meaning that the warehouse focuses on the analytical needs of the organisation. In the case of this
project, the analysis focuses on workload, injury and illness features of the source data. Another important
characteristic is that of non-volatility, meaning that the data is not modified or removed, thus ensuring its
durability. A final characteristic of importance is that of the historical nature of the data. As it is the long
term goal of this project is to be able to record and analyse data over a period of multiple seasons, the data
warehouse must have the ability to store historical workload and injury records.
Data warehouses use a multi-dimensional model to view data in an n-dimensional space, often referred to
as a cube or a hypercube. Cubes are comprised of two primary components, a fact, and a dimension. A
fact is the central component in the structure and is associated with a numeric value, known as a measure.
A measure is given perspective with the help of dimensions. Dimensions thus represent the granularity
of the measure for each dimension of the cube. The level of granularity specified for each dimension is
known as the dimension level [46].
To illustrate this concept, an example of a three-dimensional cube is provided in Figure 2.1. The cube
has three dimensions: Time, Player and Injury. Using the example of the Time dimension, data can be
aggregated at several levels. Examples of these are date, week, month, and season, and are referred to as
members. The members represented in the figure are thus: date, player and injury-type. The cells in the
figure represent a fact and are associated with numerical values called measures. In the case of the example
used here, the measure represents the distance covered by a player in kilometres. A fact typically includes
several measures, meaning that multiple workload features can be included in this thesis. Measures are
used for the task of quantitative analysis. From the figure, it can thus be seen that player 2 covered nine
kilometres on the sixth of June 2019 while the player was not injured. However, if the data were to be
aggregated at the season level, it might be seen that player 2 completed 1200 kilometres while not injured,
80 kilometres while carrying an overuse injury and 0 kilometres while being acutely injured.
6
Figure 2.1: An illustration of a three-dimensional data cube
The previous example of aggregating data at a higher level of granularity illustrates one of the several
operations available in online analytical processing (OLAP). Listed below are brief descriptions of the
primary operations which are used in this project.
1. Drill down disaggregates measures along an axis of the cube to obtain data at a finer granularity.
2. Roll up aggregates measures along an axis of the cube to obtain data at a coarser granularity.
3. Slice removes a dimension from the cube.
4. Dice filters cells according to a Boolean condition.
5. Add measure adds new measures to a cube. The new measures are calculated from other measures
or dimensions.
As the data warehouse modelled in this project does not represent a collection of data from the entire
organisation, it is technically classified as a data mart. For the purpose of this study, however, it is referred
to as a data warehouse.
2.2 Data Mining
Data mining is an interdisciplinary subject that has several definitions. In industry and the research
milieu, data mining is often seen as being synonymous with the process of knowledge discovery from
data, or KDD. This process is comprised of a sequence of steps which aim to extract interesting patterns
representing knowledge from data. In other cases, data mining is defined as just one of the multiple steps
involved in this process of knowledge discovery. This thesis adopts a broad view of the data mining term
and defines it as the process of discovering interesting patterns and knowledge from large quantities of
data [17].
7
2.2.1 The CRISP-DM Process Model
The Cross-Industry Standard Process for Data Mining was first introduced in 2000 and has since become a
popular framework for a wide variety of data mining projects. The CRISP-DM reference model is shown
in Figure 2.2 and provides an overview of the life cycle of a data mining project. The life cycle is divided
into six phases. These are connected by arrows which highlight the respective dependencies between
them. The order in which the phases are carried out is not strict, and the outcome of a particular phase will
determine which phase should be carried out next [50].
Figure 2.2: The CRISP-DM data mining life cycle
Each phase of the data mining cycle is outlined as follows:
1. Business Understanding
The initial phase of the cycle involves understanding the project objectives and requirements. This
knowledge is used to define the data mining problem [50].
2. Data Understanding
This phase involves the collection of data from its sources as well as the process of becoming
familiar with the data. The latter usually includes activities such as the identification of data quality
8
problems, the discovery of first insights into the data, and the detection of interesting subsets of data
[50].
3. Data preparation
Data preparation encompasses all the activities which are required to construct the final data set
that will be used by the modelling tools. The process involves tasks such as attribute selection, data
cleaning, attribute construction, and data transformation [50].
4. Modeling
This phase involves the selection of modelling techniques, the generation of a test design, model
building, and model assessment [50].
5. Evaluation
In the evaluation phase, the constructed model(s) are evaluated to ascertain whether they achieve the
business objectives [50].
6. Deployment
In the final phase, the knowledge gained from the data mining is organised and presented in a way
which can be used by the organisation [50].
2.3 Machine Learning in Football
As a result of injuries representing a financial burden to football clubs and having a negative impact on
team performance [15], there has been a great deal of research in the field injury prediction. Moreover,
much of this research focuses on the relationship between athlete workloads and injury risk [7]. Athlete
workloads are seen as being potentially useful in injury prediction as they fall under the classification
of modifiable risk factors. The multifactorial nature of injury prediction poses a significant challenge
as many of the risk factors associated with injury are non-modifiable [7]. These include factors such as
age, sex, sport, injury history, and level of competition [34; 21]. One of the drawbacks of non-modifiable
factors is that they do not provide coaches and medical staff with tools to prevent injury. For this reason,
modifiable factors which are associated with injuries are more useful for predictive modelling, as they
provide coaches and medical staff with intervention points which can be used to reduce the risk of injury [7].
2.3.1 Workloads in Predictive Modelling
Athlete workloads are considered potentially useful in predictive injury modelling for the reason that
they are modifiable, and that they are associated with a risk of injury. Workloads are defined as the
cumulative amount of stress placed on an individual from one or more sessions over a period of time,
and can be measured along several dimensions [7]. These include external/internal, subjective/objective,
and absolute/relative loads. External load is defined as the work completed by an athlete, measured
independently of the athlete’s internal characteristics [48]. Measuring the external load typically involves
quantifying the load of an athlete. It may include measures such as hours of training, distance run, distance
run above a specified speed, or number of games played [42]. An advantage of using external loads in
predictive modelling is that they can be easily measured using technological devices which are relatively
9
inexpensive and of little inconvenience to athletes. The internal load is a means of quantifying an athletes
response to external loads and is a measure of the relative physiological or psychological stress imposed
on an athlete [16]. Internal loads give rise to the second dimension of measurement, which classifies a load
as being either subjective or objective. Subjective measures are self-reported and may include measures
such as reported perceived exertion (RPE) or questionnaires on well-being [7]. Objective measures include
a wide variety of physiological responses to workloads and range from measuring an athletes heart rate or
blood lactate concentration, to conducting biochemical, hormonal and immunological assessments. There
is, however, limited research on the use of biochemical, hormonal and immunological measures due to
their collection being a costly and time-consuming process [16]. The third dimension of measurement
takes into consideration the absolute and relative workloads of athletes. Absolute workloads are simply
a summation of an athletes workload over a specified period [7], whereas relative workloads take into
account the history of loading or the fitness of an athlete [42]. Relative workloads include measures
which account for variations in workloads such as, the mean standard deviation ratio (MSWR). They also
include measures which compare workloads over two time periods such as the acute:chronic workload
ratio (ACWR), and measures which consider the moving average of an athletes workloads such as the
estimated weighted moving average (EWMA) [35; 49; 5; 39].
2.3.2 A Machine Learning Approach
In recent years, there has been a dramatic increase in the use of machine learning to gain insight into the
nature of injury and performance in professional sports [6]. The following trend aims to exploit complex
patterns underlying available data and has been sparked by several factors. Firstly, traditional statistical
techniques used in the past fail to account for the complex non-linear relationships which exist between
predictor variables [39]. Secondly, the use of Electronic Performance and Tracking Systems (EPTS)
provide researchers with a wealth of data depicting multiple aspects of an athletes movements. Lastly, data
science is a rapidly emerging area that is providing more evidence-based decision-making across many
industries [24].
Predictive modelling in football aims to provide coaches and trainers with practical, usable and interpretable
models to aid decision making [28]. With this in mind, the task of predicting injury provides numerous
challenges. One of these is the trade-off between a model’s accuracy and interpretability [39]. Predictive
models should be interpretable for the reason that researchers and coaching staff need to know why an
athlete may be at risk. This enables one to change training strategies to avoid high-risk situations from
occurring. For this reason "black box" approaches such as neural networks are considered impractical [39].
On the other hand, it is of vital importance that predictive models are highly accurate, as "false alarms"
can negatively impact training strategies and athlete performance. Recent research thus makes use of a
variety of modelling techniques with neural networks, decision trees and support vector machines being
the most popular in the past five years [6].
Another challenge in predicting injury is that of the severe class imbalance between classes of injury and
non-injury [5]. Strategies such as oversampling, undersampling, and synthetic data generation have been
used to correct these imbalances [5; 39]. However, little is known about the volume of data needed to build
accurate predictive models [5].
10
2.4 Related Work
Despite the recent move towards using machine learning for injury and performance prediction, there
is still limited work which directly focuses on multivariate relationships between athlete workloads and
injury risk [6]. This is surprising given the extensive body of univariate research that has been conducted
in this area. Gabbett et al. have found that rugby players are at a higher injury risk when their workloads
are above certain thresholds[13; 12; 11]. Ehrmann et al. found that soccer players cover a greater distance
per minute in the weeks preceding an injury compared to their seasonal averages [8]. Anderson et al.
found a strong correlation between the monotony of basketball players’ workloads and the risk of injury[1].
Similarly, Brink et al. observed that injured young soccer players had higher values of monotony in the
weeks preceding an injury than non-injured players [4].
Two machine learning studies, however, are identified at the time of writing this thesis and are considered
particularly relevant to the work presented here. In a three-year study which uses GPS training loads and
perceived exertion ratings to predict injury in elite Australian football players, Carey et al. [5] concludes
that prediction models show poor ability to predict injuries. In another one-year study which uses GPS
workloads and a previous injury feature to predict future injuries in professional Italian football players,
Rossi et al. [39] concludes that their prediction model shows a good trade-off between accuracy and
interpretability. Given the high degree of similarity between the approaches taken in these studies, it is
surprising that they produce conflicting results. As each of the studies are conducted using data gathered
from a single football team, it should be expected that the study with the larger data set would produce
models with a superior ability to predict incidents of injury. This is, however, not the case, and leads to
several questions concerning the models’ ability to generalise. The discrepancy in results highlighted here
is one of the motivating factors for the work conducted in this thesis, which aims to predict future injury in
football players from a single season of workload and injury data. Moreover, this study aims to conclude
whether it is possible to predict injury using workload and injury data collected from a single season.
In an effort to reproduce much of the work that is conducted in the two studies above, a brief account of
their similarities and differences is presented. Starting with the data sets, both studies create models using
data which include a combination of absolute and relative workloads. The absolute workloads common to
both studies are those of distance and high speed running distance. Similarly, both studies make use of
relative workload features, EWMA, ACWR, and MSWR for many of the absolute features. Additionally,
Carey et al. include the relative features, strain and rolling averages, whereas Rossi et al. introduce a
multivariate previous-injury feature. Further similarities between the studies include the use of logistic
regression and random forest prediction algorithms. Rossi et al. additionally makes use of decision trees,
whereas Carey et al. include general estimating equations and support vector machines. Both studies
make efforts to compensate for class imbalance. The Rossi et al. approach makes use of synthetic data
generation using ADASYN. Carey et al. use both undersampling and oversampling techniques. Finally,
dimensionality reduction is also used in both studies to compensate for the highly correlated feature sets.
Rossi et al. use recursive feature elimination to eliminate irrelevant features and enhance interpretability.
Carey et al. however, use principal component analysis to reduce the dimensionality of the data set.
One additional difference between the two studies is worth mentioning. Rossi et al. maintain that only a
subset of three features is needed to predict injury successfully. These include the EWMA of a players
high speed running distance, MSWR of a players distance, and the EWMA of the previous-injury feature.
The first two features are included in the data sets of both studies, whereas the last feature is only included
in the Rossi et al. study. Given that this is shown to be the most significant prediction variable, it may
11
account for the difference in results achieved by the two studies.
12
Chapter 3
Source Data
The data for this project is collected from a professional Norwegian football team during the 2019 season
spanning from January to December. It is comprised of workload, injury, illness, and log data from three
sources, namely, a GPS tracker, a training log, and an injury log. Players included in the study underwent
training and competition without any interference from the research group. A total of 3005 GPS sessions
are recorded from 38 players. Additionally, 4726 log entries, 57 incidents of injury, and 37 cases of illness
are recorded during the season.
This chapter provides a detailed account of the three data sources mentioned above. Starting with the
GPS tracker, a description of each of the recorded features is given. A brief account of the injury, illness,
and log data is then provided before commenting on potential issues that need to be considered before
processing the data.
3.1 GPS Data
Players are monitored during both training and competition using the Catapult OptimEye X4 athlete
tracker, a device placed between the shoulder blades in a specially designed vest. The device is equipped
with a 10Hz GPS tracker, 3-axis 100Hz accelerometers, 3-axis gyroscopes, and 3-axis magnetometers,
which together generate a set of training features describing different aspects of a player’s workload. Of
the numerous features which can be extracted from the tracking system, six are chosen for this project.
Furthermore, the time interval for which features are extracted is specified to one second, meaning that all
training features are aggregated over one-second time intervals. Files are generated as comma-separated
values (CSV) files, and one file is generated for each session. A single file thus contains workload data
for all players that participated in the given session. Each line represents a single second of training and
contains workload data of a given player for the specified time interval.
Table 3.1 presents an overview of the features generated by the GPS unit. In addition to a player’s name,
each line of a CSV file also contains six workload recordings, a timestamp, and the interval number for the
session.
13
CHAPTER 3. SOURCE DATA
Interval Interval number for current session

Time Interval start time
First Name Players first name
Last Name Players last name
Acceleration Load Sum of acceleration values for a specified period
Player Load Sum of the accelerations across all axes of the accelerometer
Total Distance Distance covered in meters for the specified interval
V4 Distance Distance covered at a speed of 20 - 25km/h
V5 Distance Distance covered at a speed in excess of 25km/h
HSR Distance The sum of V4 Distance and V5 Distance
Table 3.1: Features extracted from Catapult OptimEye X4
Two of the features presented above require further explanation:
Player Load
Player load is defined as the sum of the accelerations across all axes of the tri-axial accelerometer during
movement [26]. It serves as an indication of the amount of work done by a player over a specified time
interval, and can be expressed as follows:
t=n q
PlayerLoad = ∑ ( f wdt=i+1 − f wdt=i )2 + (sidet=i+1 − sidet=i )2 + (upt=i+1 − upt=i )2 (3.1)
t=0
where:
f wd, side, up are acceleration values in forwards, sideways and upwards directions respectively
t is time
Acceleration Load
Acceleration load is defined as the sum of accelerations over a specified period [25]. It serves as an
indication of acceleration volume or the total amount of speed change during a specified interval of
time. Acceleration-load is calculated from smoothed velocity data at 0.2-second intervals and treats both
accelerations and decelerations as positive values. By deriving its value from smoothed velocity data,
acceleration load values are less susceptible to noise than that of values produced from RAW positional
data, thus providing a more realistic representation of changes in speed.
3.2 Injury Data
Player injuries and illnesses are recorded throughout the study using a software called AthleteMonitoring.
Incidents are reported by the team’s medical staff, and each entry contains information describing the
nature and duration of an injury or an illness. Table 3.2 summarises the key variables recorded by the
application.
14
Id Player Id
Team Name Name of the sports team
Type Either injury or illness. Injury is either acute or overuse
Legacy Diagnosis Diagnosis of of injury/illness
Diagnosis Diagnosis of of injury/illness
Body Area Body area affected by injury/illness
Mechanism Mechanism by which the injury occurred
Classification Either new or pre-existing
Date of Injury Date of occurrence
Recovery Date Date of recovery
Missed days Number of missed training/competition days
Severity Classified as slight, minimal, mild, moderate or severe
Activity Activity during which the injury occurred
Participation Level of training/match participation during injury period
Table 3.2: Features recorded for player injuries and player illnesses
3.3 Training Log Data
The third source of data is also captured using the AthleteMonitoring software and involves a self-evaluation
that is completed by players after each session. Evaluations serve as an indication of how closely players
have followed their planned training sessions as well as an indication of their perceived level of enjoyment.
The features extracted from the evaluation are summarised in Table 3.3.
Activity Type of training performed

Planned duration The planned duration of the training
Duration The actual duration of the training
Planned difficulty The planned difficulty of the training
Difficulty The perceived difficulty of the training
Planned load The planned load of the training
Load The calculated load of the training
Planned enjoyment The planned enjoyment of the training
Enjoyment The players perceived enjoyment
Table 3.3: The features recorded by a player after the completion of each training session
3.4 First Impressions
Previous work has identified three training/injury features associated with the risk of injury, namely, the
estimated weighted moving average of a player’s high speed running distance, the mean standard deviation
15
ratio of a player’s total distance, and the estimated moving weighted average of a constructed injury feature
[39]. The RAW data necessary for calculating the features mentioned above are total distance, high speed
running distance, the number of injuries incurred by a player, and the number of days since returning
to training after a previous injury. All of these can be calculated from the available source data. Both
distance and high speed running distance are provided by the GPS unit and can be aggregated to the level
of granularity specified. The number of injuries received by a given player can be calculated by summing
the player’s injury records in Table 3.2, and the number of days since a player’s return from injury can be
obtained by calculating the difference between a players recovery date and the date of the current session.
Additional data may be required for research beyond the scope of this thesis. This includes data such as
player age, weight, height, and position, as well as possible performance indicators. The topic of additional
data is covered in further detail in the section on data warehouse modelling.
All of the data sources mentioned above are generated as CSV files, each of which has several issues that
need to be considered during the data cleaning process. These issues are briefly highlighted below.
GPS data:
• Duplication of data. Files may include duplicate lines of data. Multiple instances of duplication exist
in the provided source data and must be resolved before loading the data into the data warehouse.
• Missing and wrongly ordered parameters. The generation of the CSV files is a manual task which
is carried out for each session and involves specifying which workload features are to be included.
Human errors result from the mundane nature of the job. Errors such as missing parameters and
wrongly ordered parameters are present among the files used in this thesis.
• Non-training data. Another human error encountered in the data set is that which results from
players forgetting to turn off their GPS tracking units. In these instances, multiple lines of data are
generated outside of a session, resulting in lines of non-training data, which may easily outnumber
the lines of actual training data.
• Players registered with multiple profiles. As a result of difficulties encountered with the training
software, multiple profiles have been created for specific players. A player’s name is used to match
training data with data gathered from the other two sources. For this reason, it is essential to resolve
naming conflicts to guarantee that training data maps to the correct log and injury/illness data.
• Players without data. Due to a shortage of GPS trackers, there exist two players with no recorded
training data. There, however, exists data from the other two sources for these players, a fact which
needs to be taken into consideration when loading data into the data warehouse.
Log data:
• Data spread over multiple lines. Numerous data entries are spread over multiple lines. This needs to
be taken into consideration when reading data from a file.
• Players registered with multiple profiles. Some players have been registered with multiple profiles
and therefore have multiple profile-ids. Resolving such conflicts is of vital importance when mapping
log data to the correct data from other sources.
16
• Players without log data. A third point of consideration for the log data is that several players do
not have any training log records.
Injury/Illness data:
• Incorrect dates. The are multiple cases in which an injury’s recovery date is recorded as occurring
before the date of injury.
• Missing injury type. The type of injury incurred by a player is classified either as acute or overuse.
There is one entry in the source data which is not classified.
17
Chapter 4
Data Warehouse Modelling
There is no consensus on the phases that should be followed during the data warehouse design process,
and for this reason, there exists no standard methodological framework. In this thesis, which implements a
relatively simple data warehouse, the design process adopts a methodology proposed by A. Vaisman and
E. Zimanyi, which is both well documented and easy to understand [46]. The methodology is based on
the assumption that data warehouses are a specialised form of traditional databases that focus primarily
on analytical tasks. Their design thus follows the same phases of traditional database design, namely,
requirements specification, conceptual design, logical design, and physical design. An overview of the
design process is presented in Figure 4.1.
Figure 4.1: Phases in the data warehouse design process
Furthermore, the design process in this thesis is carried out using an analysis/source-driven approach,
which takes into consideration both the analysis needs of the users, as well as the data which are available
from underlying source systems. This is considered an optimal approach for the current project as it ensures
that users’ needs for future studies are met while ensuring that all available data are taken into consideration.
Users are thus provided with the choice of including data which may not have been considered during the
initial goal-setting phase.
A final point of clarification is that of the general method of design, for which there exist two alternatives,
namely, top-down and bottom-up design. Top-down design focuses on merging user interests from an
entire organisation, in order to create a data warehouse which is cross-functional in scope. In the case
of NIH, this would involve modelling a schema that would include training data from the organisation’s
entire scope of sports research. Such a schema may subsequently be tailored into several separate data
marts, each focusing on a specific area of research. This project, however, adopts a bottom-up design
method, focusing only on a sub-set of the organisation’s data. The resulting product is a data-mart that is
specifically tailored to aid analysis of the organisation’s football data.
19
CHAPTER 4. DATA WAREHOUSE MODELLING
4.1 Requirements Specification
As the earliest step in the design process, the requirements specification phase is responsible for de-
termining which data is needed and how it is organised. It is a crucial step in identifying the essential
elements of a multidimensional schema and has a significant impact on the future success of the data
warehouse’s ability to perform its required tasks. When using an analysis/source-driven approach, tasks
from the analysis-driven approach and the source-driven approach are combined and used in parallel. The
combination of these two approaches ensures an optimal design by taking into account both the analysis
needs of the user(s) as well as the data which is available for creating a multidimensional schema [46].
From Figure 4.2, it can be seen that the requirements specification phase begins with the identification of
the organisation’s users as well as the data sources available. As it is the intention of this thesis to serve a
variety of research needs, all users must be identified to ensure that these needs are met. The next step
involves defining the project’s goals and applying a derivation process to the source data. This step results
in the initial identification of facts, measures, and dimensions. Finally, the information gathered from the
previous two steps is used to produce a separate requirements specification for each of the two approaches.
Figure 4.2: Steps in the requirements specification phase using an analysis/source-driven approach
Starting with the analysis-driven approach, the steps for both of the analysis-driven and source-driven
approaches are described in detail below:
4.1.1 Analysis Driven Approach
Identify users
Two primary user groups are identified for this project. The first of these are researchers at NIH who
are interested in gaining insights into the relationships between training workloads, injury, illness, and
performance. The second group includes computer scientists at UIO who are interested in creating injury
prediction models based on player workload and injury data.
Analysis needs
This step begins with the identification of project goals, which plays an essential role in converting user
20
needs into data elements. From the two user groups identified above, three project goals are formulated
and listed as follows:
• To identify associations between workload features and player injury.
• To predict the onset of injury using workload features identified in previous studies.
• To identify associations between player biometrics, workload features and player injury/illness.
The goals established above are broken down into a set of sub-goals to identify both common and unique
elements among the primary goals. The sub-goals generated for this project are listed below:
1. To be able to extract a player’s GPS training features at multiple granularities.
2. To be able to extract injury and illness features for each player.
3. To be able to extract a player’s biometrics.
The next task in defining user analysis needs is to operationalise the above sub-goals by defining several
representative natural language queries. These aim to capture the functional requirements of the data
warehouse, and are listed below with respect to the goal they are derived from:
1. To be able to extract a player’s GPS training features at multiple granularities.
(a) The total distance, player load, acceleration load, V4 distance, V5 distance and HSR distance
covered by a player during a given session, week, month, season.
(b) Compare the workload features for a given time period (eg. one minute) with the average of
that feature over another time period (eg. a session).
2. To be able to extract injury and illness features for each player.
(a) The duration of an injury/illness.

(b) The number of days since recovering from an injury/illness.
(c) The number of injuries/illnesses for a given player.
(d) The number of days missed as a result of an injury or an illness.
(e) To be able to distinguish between acute and overuse injuries.
3. To be able to extract player biometrics.
(a) The age, weight, and height of a player.
The final task in determining the analysis needs involves breaking down the natural language queries in
order to define the facts, measures, and dimensions of the warehouse. As this is a manual process, a brief
discussion of each set of queries is given below. This includes information about which elements need to
be aggregated as well as the aggregation functions used.
21
1. (a) The total distance, player load, acceleration load, V4 distance, V5 distance, and HSR distance
covered by a player during a given session, week, month, season.
From the above query, a fact and two dimensions are identified. The player element is
categorised as a dimension. In contrast, the elements total distance, player load, acceleration
load, V4 distance, V5 distance, and HSR distance are identified as measures, which together
represent a fact. The query also highlights four levels of aggregation, namely, session, week,
month, and season, which naturally form the levels of a time hierarchy. Hence a second
dimension for representing time is identified. Finally, the totals for each of the measures can
be calculated using addition, and the aggregation function is specified as Sum.
(b) Compare the workload features for a given time period (eg. one minute) with the average of
that feature over another time period (eg a session).
No additional facts or dimensions are identified from the above query. However, the need for
finer levels of granularity are highlighted. These levels are identified as second and minute,
and form a natural extension of the time hierarchy identified above.
2. (a) The duration of an injury/illness.

(b) The number of days since recovering from an injury/illness.
(c) The number of injuries/illnesses for a given player.
(d) The number of days missed as a result an injury or an illness.
(e) To be able to distinguish between acute and overuse injuries.
Two additional dimensions, namely, injury and illness, can be identified from the list of
queries presented above. In the case of injury, a hierarchy, which includes the level injury-type,
is also identified. The type-level enables users to distinguish between incidents of acute and
overuse injuries. Furthermore, an additional level for the time dimension is identified. The
level identified is the date-level, and is considered essential for calculating the number of days
since a player’s last injury. No hierarchy is identified for the illness dimension. This is largely
a result of the simplicity of the nature of the research being conducted, which does not require
a detailed classification hierarchy for medical illness.
3. (a) The age, weight, height of a player.
The above query does not aid in the identification of any additional facts or dimensions.
The query is, however, important in the sense that the information required to answer it is not
directly available from the source data. This topic is dealt with later in the modelling process.
22
Document requirements specification

The requirements specification of the analysis-driven approach ends with the documentation of information
gathered during the previous two steps. Information is summarised in the form of two tables, one table for
facts and measures, and one table for dimensions and hierarchies.
Table 4.1 provides a summary of the identified fact and its measures. Measures identified during the
previous step, as well as their aggregation functions, are included. Additionally, an indication of which
queries apply to each measure is provided. It can be seen that all measures apply to queries 1a and 1b,
which focus on the granularity of the data extracted. Similarly, it is seen that queries 2a − 3a do not
concern any of the measures in the table. The following is illustrated by marking the relevant cells in
the table with a cross. The reason that the queries are not applicable is because they focus on extracting
information specific to the player, injury, and illness dimensions.
Analysis scenarios
Fact Measure Aggregation function
1a 1b 2a 2b 2c 2d 2e 3a
Training data Total Distance Sum 3 3 7 7 7 7 7 7
Training data Player Load Sum 3 3 7 7 7 7 7 7
Training data Acc Load Sum 3 3 7 7 7 7 7 7
Training data V4 Distance Sum 3 3 7 7 7 7 7 7
Training data V5 Distance Sum 3 3 7 7 7 7 7 7
Training data HSR Distance Sum 3 3 7 7 7 7 7 7
Table 4.1: Fact summary
Table 4.2 provides an overview of the dimensions. From the table, it can be seen that four dimensions are
identified, namely, the Date-Time, Injury, Illness, and Player dimensions. Both the Date-Time and Injury
dimensions include hierarchies, for which seven and two levels are seen respectively. The Time hierarchy
provides users with the ability to extract data at multiple levels of aggregation ranging from one-second
intervals at the finest level, to an entire season’s data at the coarsest level. The Type level seen in the
Classification hierarchy allows users to differentiate between acute and overuse injuries. No hierarchies
are seen for the Illness and Player dimensions. As seen in Table 4.1, Table 4.2 also provides an indication
of which queries apply to each dimension.
Analysis scenarios
Dimension Hierarchies and levels
1a 1b 2a 2b 2c 2d 2e 3a
Time
Date-Time Second → Minute → Session → 3 3 3 3 7 3 7 7
Date → Week → Month → Season
Classification
Injury 7 7 3 3 3 3 3 7
Injury → Type
Illness - 7 7 3 3 3 3 7 7
Player - 3 3 3 3 3 3 7 3
Table 4.2: Dimension summary
23
4.1.2 Source-Driven Approach
Once the requirements specification for the analysis-driven approach has been completed, the process is
then conducted using a source-driven approach. The steps involved are discussed in detail below.
Identify source systems

This step involves identifying the projects data sources and determining their respective reliability, availab-
ility, and update frequencies. As this process is covered in detail in the section on source data, it is not
repeated here.
Apply derivation process

The derivation process involves deriving facts, measures, and dimensions from the source data. Facts and
measures are usually associated with elements that are frequently updated. This makes the data gathered
from the GPS tracker a prime candidate to be a fact, and each numerical data feature a potential measure.
The training features, total distance, acceleration load, player load, V4 distance, V5 distance, and HSR
distance are thus classified as measures belonging to the fact "Training data".
Each GPS entry is also associated with a time and a player, both of which are potential dimensions.
Training data are associated with both a player and time through a many-to-one relationship. The former
results from one player being able to have many training entries, but each training entry only being
associated with one player. The latter is the result of many training sessions being able to take place at the
same time, whereas each recorded GPS entry may only occur at a single given time.
Injury data is another candidate dimension and is also associated with the GPS data through a many-to-one
relationship. This relationship arises from a non-injury being associated with many GPS entries. The final
set of source data available is that of the players training log. It provides another candidate dimension and
is associated with the training data in a one-to-many relationship, for the reason that one training log is
used to summarise multiple GPS entries.
The next step in the derivation process involves the analysis of potential dimensions and hierarchies. The
first task involves specifying the levels of granularity that need to be included in the Time dimension. The
GPS source data are summarised over one-second intervals and provide the finest level of granularity for
the data warehouse. Coarser levels are defined according to user needs, and for this thesis, it is essential to
be able to summarise data over both the level of a session and a season. One of the primary interests of this
thesis is to be able to analyse data at several different granularities. Logical candidates for representing
fine levels of granularity would thus include the levels minute and hour, whereas coarser levels would
include session, week, month, and season. The Time dimension resulting from the source-driven approach,
thus gives rise to the following levels :
Second → Minute → Hour → Session → Date → Week → Month → Season
Although the initial scope of this project is initially limited to a single football team, the data warehouse
can easily be modelled to include additional teams. For this reason, a hierarchy is included in the study’s
Player dimension. The following levels are included:
Player → Team → Sport
On closer inspection of the injury source data, it becomes clear that several levels can be identified. The
hierarchy identified in this dimension falls under the classification of an unbalanced hierarchy and is
24
discussed in greater detail in the conceptual design phase. Starting with the coarsest level, an injury entry
can be classified as either being an injury or a non-injury. An injury can then be further classified as an
illness or a physical injury, and a physical injury is categorised as being either an overuse or an acute injury.
This information thus identifies a hierarchy consisting of the following levels:
Injury → Sub-type → Type → Category
Finally, no hierarchies are identified in the Training Log dimension.
Document requirements specification

As with the analysis-driven approach, the requirements specification of the source-driven approach ends
with the documentation of information gathered during the previous two steps. Table 4.3 provides a
summary of the information discussed above. One fact containing six measures is identified and is
associated with four dimensions through one-to-many relationships. Hierarchies are presented for the
dimensions, Date-Time, Injury, and Player, whereas no hierarchies are identified for the Training Log
dimension.
Facts Measures Dimension Cardinality Hierrchies and levels

Time
Second → Minute → Hour →
Training data Total Distance Date-Time 1:n
Session → Date → Week →
Accelaration Load
Month → Season
Player Load
Classification
V4 Distance
Injury 1:n Injury → Sub-type → Type →
V5 Distance
Category
HSR Distance
Membership
Player 1:n
Player → Club → Sport
Training Log 1:n
Table 4.3: Summary of the multidimensional elements using a source-driven approach
4.2 Conceptual Design
The purpose of the conceptual design schema is to present data requirements in a clear and concise manner
that can be understood by users. By avoiding implementation details, the conceptual schema facilitates
communication between users and designers, as well as the maintenance and evolution of a database. Due
to the lack of a universally adopted conceptual model for multidimensional data, data warehouse design
frequently omits this phase and usually carries out design directly at the logical level [46]. As mentioned
above, the conceptual modelling phase is incorporated into the design process of this project, making use
of the MultiDim model, which is able to represent all elements in the data warehouse. A graphical notation
is provided in the appendices.
25
Figure 4.3: Steps in the conceptual modelling phase using an analysis/source-driven approach
Following an analysis/source-driven approach, it can be seen from Figure 4.3 that the conceptual modelling
phase begins with the creation of two conceptual schemas, one from the analysis-driven approach, and the
other from the source driven approach. These schemas are then matched before the final schema is defined,
and mappings between the source data and the data warehouse are specified. Before addressing the steps
just mentioned, it is necessary to look at two of the conceptual design issues encountered in this thesis.
The first of these is classified as a many-to-many dimension and arises during the source-driven approach.
The second is known as an unbalanced hierarchy and is encountered during both the source-driven and
analysis-driven approaches.
Many-to-Many Dimensions and Nonstrict Hierarchies

One challenge which arises from the project’s data set is that of modelling the relationship between a
player’s workload data and incidents of injury and illness. The issue encountered here is known as a many-
to-many dimension and results from the fact that a player is able to be both injured and ill at the same time.
This concept is illustrated in Figure 4.4, and may lead to a problem known as double-counting. Double
counting occurs when a roll-up operation reaches a many-to-many relationship and results in a single
measure being included in multiple aggregation operations, thus breaking the condition of disjointness and
resulting in a hierarchy not being able to ensure summarisability. Double counting may arise as a result
of a schema not meeting the requirements of the first multidimensional normal form (1MNF). This rule
requires that each measure be uniquely identified by its set of associated leaf levels, and serves as a basis
for correct schema design.
Figure 4.4: Illustration of the many-to-many relationship between a fact and a dimension
26
As an example of the issue described above, Table 4.4 is included to highlight the problem of double
counting. The data represented in the table may result from selecting the distance covered during sessions
in which a player was registered as either injured or ill. Here it can be seen that player 1 covered a total
distance of 5000m meters during a training session T1. As the player is registered as both injured and ill at
the time of training, two tuples are generated for a single session. If the data set were to be aggregated
over the time dimension, a total distance of 5000 + 5000 = 10 000 meters would be generated instead of an
expected value of 5000m.
Time Player Distance Injury/Illness

T1 1 5000 Injury 1
T1 1 5000 Illness 2
Table 4.4: An example of double-counting in a many-to-many dimension.
One method for resolving this issue is to decompose the Injury/Illness dimension into two separate
dimensions, one for injury and another for illness. This concept is illustrated in Figure 4.5, where it can
be seen that training data is associated with both an Injury dimension and an Illness dimension through
a one-to-many relationship. More specifically, this relationship stipulates that a training tuple is related
to precisely one injury and precisely one illness. In contrast, an injury or an illness may relate to one or
more training tuples. This model thus satisfies the case in which a player may be both injured and ill at
the same time, while also meeting the requirements of 1MNF. As the schema specifies that each training
entry relates to precisely one injury and one illness, a non-injury and non-illness tuple is required to satisfy
this condition for training sessions in which a player is not ill or injured. Similarly, a non-training tuple is
required for every incident of injury or illness where a player was absent from training.
Figure 4.5: A decomposition of the injury/illness dimension
Unbalanced Hierarchies
Another issue that is encountered during the modelling of the Injury and Illness dimensions is that of
unbalanced hierarchies and arises when at least one level in a dimension’s hierarchy is not mandatory.
This results in parent members within the hierarchy not having any child members, as is seen for the
hierarchy of the Injury dimension in Figure 4.6. As described above, the Injury dimension is required to
include non-injuries in order to satisfy the one-to-many modelling constraint between itself and training
entries. This results in a non-injury member with no child members, and a hierarchy which does not satisfy
summarisability conditions. Using the injury hierarchy in Figure 4.6 as an example, it becomes clear that
when all measures in the fact table are associated with the "Type" level, aggregation of these measures into
the "Category" level is only possible for the injury member and not for the non-injury member. A similar
27
Figure 4.6: Unbalanced hierarchy
problem is encountered for the Illness dimension, and its solution is discussed in the section on logical
design.
Having presented some of the initial challenges associated with the modelling of the conceptual schema, a
brief account of each of the steps involved in generating the final conceptual schema is given.
4.2.1 The Initial Analysis-Driven Conceptual Schema
An initial conceptual schema is generated from the analysis-driven requirements specified in the require-
ments phase and is presented in Figure 4.7. Given that the analysis for this project is centred around
training/competition data, a fact table called "Training data" is created and contains six measures, namely,
total distance, player load, acceleration load, V4 distance, V5 distance, and HSR distance. All of these can
be directly extracted from the GPS source data and are aggregated using the SUM operation. Each entry in
the fact table is associated with four dimensions through a one-to-many relationship, meaning that each
fact entry is associated with only one entry in each of the four dimensions. In contrast, a given entry in any
of the dimensions may be associated with multiple entries in the fact table. For example, each workload
data entry in the fact table is only associated with one player, whereas a single player is expected to have
multiple lines of recorded workload data.
The Date-Time dimension contains five levels of aggregation. There are a few points that are worth noting
with regards to this dimension. Firstly, the levels, Second, Minute, and Date are represented in a single
level despite being identified as separate levels during the requirements specification phase. Secondly, the
Day level may include one or more sessions as a result of multiple sessions being able to take place on the
same day. Thirdly, the levels Week and Month are modelled as a parallel hierarchy for the reason that a
week may extend over two months, and thus cannot be aggregated directly into the Month level. Finally, a
Season is considered the highest level of aggregation for the Time hierarchy and is equivalent to a calendar
year running from 1 January to the 31 December.
The Date-Time dimension is of particular interest in this study as data is extracted at one-second intervals
and provides a user with the opportunity to conduct analysis at a relatively high degree of precision.
28
Figure 4.7: The initial conceptual model using an analysis-driven approach
Including both dates and times in a data warehouse schema provides users with a vast range of granularities
at which data can be extracted. It also provides designers with several choices regarding how date and
time are to be modelled. These options are discussed in further detail in the logical design phase. The
most straightforward alternative is taken at the conceptual level as a means of facilitating communication
between users and designers.
Injury and Illness dimensions are represented as separated dimensions, each including two levels of
aggregation, namely, Category and Type. Despite no hierarchy being identified during the requirements
specification phase, the conceptual model includes a hierarchy of two levels in the Illness dimension.
The Category level is similar for both injury and illness and is introduced to satisfy the one-to-many
relationship between the Illness and Injury dimensions and the fact table. Thus, each entry in the fact table
is classified as either a non-injury/illness or as an injury/illness. The following relationship is expressed
using the exclusive relationship symbol seen in the figure. In the case of the Injury dimension, the Type
29
level is used to distinguish between overuse and acute injuries. The Type level in the Illness dimension is
included for the purpose of potential future classification of illnesses.
The final dimension included in the schema is that of the Player dimension, which includes biometric data
such as the age weight and height of a player. No hierarchy is included for this dimension.
4.2.2 The Initial Source-Driven Conceptual Schema
The initial conceptual schema generated from the projects source data is presented in Figure 4.8. It is very
similar to the schema generated from user requirements and therefore only a brief discussion of the their
differences is presented here.
Figure 4.8: The initial conceptual model using an source-driven approach
30
A Training Log dimension is included in the source-driven schema and contains features from the log
completed by the players after each session. It is related to the fact table through a one-to-many relationship,
meaning that each log entry may be associated with one or more fact entries. In contrast, each training
entry is only related to one log entry. This dimension does not include a hierarchy.
The Player dimension introduces a hierarchy that has two levels of aggregation, namely Club and Sport.
Furthermore, the player table includes player name features, whereas no biometric features for a player are
available here.
4.2.3 Conceptual Schema Matching
The process of matching the two initial conceptual schemas presented above is relatively straight forward
due to their similarity. The schemas’ respective fact tables, as well as their Date-Time, Injury, and Illness
dimensions, are identical. For this reason, they can be added to the final conceptual schema without any
changes being made. The Training Log dimension from the source-driven schema provides additional
information that may provide useful for several reasons. The most important being that it includes the
reported perceived exertion (RPE) of a player for each session. This dimension can also be included in the
final conceptual schema without needing to make any changes. Finally, the Player dimensions from each
of the two initial schemas have several differences that must be resolved. Player biometrics represented
in the analysis-driven schema is not available from the source systems provided, and are therefore not
included in the source-driven schema. Biometric data can, however, be obtained from external sources,
and for this reason, are included in the Player dimension of the final conceptual schema. The hierarchy
presented in the source-driven schema is not useful for this thesis. However, it may provide useful for
future research if the project should include data from additional clubs and different sports. As it is of little
cost to include the two additional levels in the given hierarchy, they too are included in the final schema.
4.2.4 The Final Conceptual Schema and Mappings
A final conceptual schema is generated from the analysis-driven and source-driven schemas and is presen-
ted in Figure 4.9. It consists of a single fact table that is comprised of six measures and is associated with
five dimensions, namely, Date-Time, Injury, Illness, Player, and Training Log.
The Injury, Illness, and Date-Time dimensions are identical to those modelled in the analysis-driven and
source-driven schemas and are discussed in detail above. The most important points from this discussion
were that the date and time of a training entry are modelled in a single dimension to facilitate communic-
ation between technical and non-technical-users. Furthermore, the Injury and Illness dimensions both
contained an unbalanced hierarchy resulting in the existence of exclusive relationships between levels.
The dates and times of each training entry can be extracted from GPS source data. It can be seen from
Table 3.1 that GPS files include a timestamp for each entry, while the date for each training can be extracted
from the file header. Similarly, injury and illness data can be extracted from the injury/illness file produced
by the AthleteMonitoring application. Injury and illness data needs to be separated during the extraction
transformation and load (ETL) process and loaded into their respective dimensions.
The Training Log dimension from the source-driven schema is included in the final schema. Despite this
data not being directly related to the current analysis needs of the project, it is included for the reason that
it may prove relevant for future work. Log data can be extracted from the training log file generated by the
AthleteMonitoring application. This file contains data for all log entries submitted during the study.
31
Figure 4.9: The final conceptual model generated from the analysis-driven and the source-driven models
Finally, the Player dimension is created as a combination of elements from the source-driven and analysis-
driven schemas and includes a hierarchy with two levels of aggregation. The first of these is the Club level,
which is limited to just one club-id feature that serves as a unique identifier for a given football club. The
decision to include this dimension is motivated by the organisation’s desire to expand the current study and
include training data from other football clubs. Similarly, the Sport level is included in the hierarchy and
contains a single attribute, namely sport-id. The player’s name is excluded from the Player dimension for
privacy reasons, and a unique player-id is adopted as a means of differentiating between players. Player
biometrics, age, weight, and height have been included as they are necessary for achieving one of the
project goals defined in the requirements specification. The biometric data required is not available among
the data sources and needs to be obtained from a source within the organisation.
32
4.3 Logical Design
The logical modelling process is illustrated in Figure 4.10 and is made up of two steps. The first of these
involves the generation of a logical schema from the previously defined conceptual schema. The second
involves the specification of an ETL process, which takes into consideration mappings and transformations
defined during the conceptual design phase.
Figure 4.10: Steps in the logical modelling process
There are several approaches to implementing the multidimensional model. These include relational OLAP
(ROLAP), multidimensional OLAP (MOLAP), and hybrid OLAP (HOLAP). This thesis adopts a ROLAP
approach, meaning that data is stored in relational databases and can be queried using SQL. A strong
motivating factor for this approach is the relative ease of implementation, as well as it being the preferred
approach for large amounts of data.
There are several ways in which data warehouses can be relationally implemented. A brief discussion of
the implementations relevant for this project is given before the logical schema is presented.
The first possible implementation is known as a star-schema and consists of a central fact table as well
as a single dimension table for each dimension. Dimension tables are typically not normalised and may
contain a lot of redundant data, which is consequently one of the drawbacks of this implementation. The
star-schema, however, offers two advantages that are of particular interest for this project. These are its ease
of understanding and its query performance. The relatively simple design of a star-schema facilitates easier
navigation and fewer join paths, resulting in less complex queries and improved performance. The second
implementation of relevance for this study is known as a snowflake schema and avoids data redundancy by
normalising dimension tables. Snowflake schemas have the advantage of easier data maintenance and re-
quire less storage. However, as a result of the increased number of join paths, performance is affected, and
queries can be complex. The final implementation of consideration is that of the starflake schema, which is
a combination of the star and snowflake schema, where some dimensions are normalised, and others are not.
4.3.1 The Logical Schema
The relational implementation used in this thesis is that of a star schema consisting of one fact table and
six dimensions and is presented in Figure 4.11. In many cases, a logical schema can be generated using a
set of general mapping rules which directly transform a conceptual schema into a logical one. As these
rules do not capture the specific semantics of some of the hierarchies identified in the conceptual design
phase, the transformation process is carried out manually and discussed below.
The Date-Time dimension presents an interesting challenge due to the fine granularity of the data being
modelled. This project aims to support precise calculations down to the finest granularity, while at the same
33
time supporting a classic Date dimension. As a classic Date dimension may include attributes that cannot
be generated using SQL, there arise several alternatives with regard to the modelling of the Date-Time
dimension. One possibility is to include all levels of granularity in a single Date-Time dimension and
provide access to all of the date-time attributes through one join path. The problem which arises with this
approach is that the Date-Time dimension quickly becomes large and requires a lot of storage. Assuming
that a football team partakes in 200 days of training/competition a year and that the Date-Time dimension
excludes days of non-training, this would result in the dimension containing 17280000 entries for each
year of training. This number could be reduced by excluding all date-time entries which are not associated
with an entry in the fact table. By doing this and assuming that each training session was to last one hour,
the Date-Time dimension can be drastically reduced to containing approximately 720000 entries per team
per year of training. A second modelling approach is to split the Date-Time dimension into separate Date
and Time dimensions, and in so doing, further, reduce the number of entries to 365 for the Date dimension
and 86400 for the Time dimension (assuming a 24-hour day is modelled). This approach, however, leads
to complex queries when calculating time-spans between dates. A third approach involves the introduction
of a date-time stamp attribute in the fact table as an addition to the implementation of separate Date and
Time dimensions [27]. This approach has the advantage of reducing the number of date and time entries,
as well as avoiding the issue of complex queries when performing precise calculations at fine granularities.
It can been seen in Figure 4.11, that this thesis adopts the approach of separating the Date-Time dimension
into two dimensions (Date and Session), and including a date-time stamp in the fact table. The Session
dimension has a session-id as its primary key and includes the attributes, session number for current
the season, and session type. Furthermore, both Date and Session dimensions are denormalised. All
attributes are included in a single table to improve performance and to simplify querying data at different
granularities.
The issue of unbalanced hierarchies in the Injury and Illness dimensions is presented in the conceptual
design phase. As these hierarchies do not satisfy summarisability conditions, a strategy must be implemen-
ted to correct this issue. This thesis adopts the use of placeholders to transform unbalanced hierarchies
into balanced ones. It can be seen in Figure 4.12 that a placeholder is introduced at the type level for
non-injuries in the Injury dimension. One possible disadvantage of this solution is that the introduction of
meaningless values requires additional storage space.
This is, however of little concern in this thesis as all non-injury and non-illness measures in the fact table
reference a single non-injury and a single non-illness tuple respectively. The additional space required for
the introduction of a "non-injury" and a "non-illness" attributes is thus equivalent to the space required
for storing an extra 21 characters, which translates to an additional 23 bytes when using PostgreSQL’s
text datatype. The hierarchies in both the Injury and Illness dimensions are denormalised, resulting in
a single table for the Injury dimension and a single table for the Illness dimension. The motivation for
this decision is to increase performance. By including all attributes in a single table, the number of join
operations required to access the leaf level in the Injury or Illness dimension is reduced by two. This
serves to increase performance as well as simplify queries. Furthermore, due to the relative simplicity of
the medical records, little additional information is available for the levels Category and Type. For this
reason, the data can be represented in a single table without much duplication occurring.
The Player dimension is also denormalised, and all attributes are included in a single table. As the Team
and Sport levels are incorporated into this thesis for possible future work, only a single unique identifier
is used for them. The fact that the data for this thesis is gathered from just 38 football players means
that these levels can be represented in a single table without requiring much extra storage capacity. The
34
Figure 4.11: A logical data warehouse schema for football data
35
removal of two join paths has the apparent benefit of improved performance as well as the advantage of
simpler querying.
The Training Log dimension is included in the logical schema for the reason that it may provide useful
to the future work of the research project. All attributes from the conceptual schema are included and
represented in a single table with a unique log-id serving as the primary key for the table. Finally, the
fact table contains all measures presented in the conceptual schema as well as a foreign key for each of
the dimension tables, a date-time stamp, and a unique id. The inclusion of the date-time stamp serves to
simplify time-span calculations. The unique id is used as the primary key for the table.
Figure 4.12: The use of placeholders to transform unbalanced hierarchies into balanced hierarchies
4.3.2 Definition of the ETL Process
The second step of the logical design phase is to describe the transformations required before source data
can be loaded into the data warehouse. This process builds upon the mappings identified in the conceptual
design phase.
Starting with the Date dimension, a script is used to generate the majority of attributes found in this
table. The exception to this rule is the "matchday indicator" attribute for which dates are gathered from
club personnel. As there are less than 365 days of recorded sessions per year, one consideration when
developing the Date dimension is whether to include an entire calendar year. For the reason that it does
not cost much in terms of storage, the Date dimension includes an entire calendar year for each year of
training/competition. Furthermore, this approach simplifies matters if additional football teams are included
and eliminates the process of identifying non-training days for multiple teams. Another consideration is
that of the table’s primary key for which several possibilities exist. Two common techniques involve the
use of a meaningless surrogate key or a readable number such as 08031983 to express the date 8 March
1983. The latter option is chosen for this thesis.
The Session dimension is generated from an ordered text file containing the dates and times of all sessions.
The tables primary key is a number corresponding to the sessions position in the text file. The value of
the primary key for the first session is thus one and is incremented by a value of one for every additional
session added. As a result of the text file being ordered, the primary key of an entry in the Session table
corresponds to the number of that particular session relative to the first session entry ever loaded. The
36
"Session Number Season" attribute is calculated during loading with the use of a counter which resets
every time the year is incremented. Lastly, the data needed to load the sessions "Type" attribute is gathered
from club personnel.
Both the Injury and Illness dimensions are loaded from a CSV file generated by the AthleteMonitoring
application. Their loading process requires that entries for injuries and illnesses be separated and loaded
into their respective tables. Additionally, text attributes must be converted to the corresponding data type
of each relational attribute. No additional data is required for these dimensions, but a non-injury and a
non-illness entry must be generated for the respective tables.
Data required for the Player dimension is collected from several sources. Firstly, the player-id attribute
is extracted from the AthleteMonitoring application, which randomly assigns player-ids on creation of
an athlete profile. As several players are not registered on this system, additional player-ids must be
created for these players. Similarly, ids for both the team-id and sport-id attributes must be created for this
dimension. The remaining attributes age, height and weight are gathered from team personnel.
Data for the Training Log dimension is loaded from a text file generated by the AthleteMonitoring
application. Text attributes need to be converted to the corresponding data type of each attribute in the
Training Log table. A non-training log entry must also be generated for measures in the fact table which
do not have an associated log entry.
Finally, measures in the fact table are loaded from the GPS generated CSV files, and each attribute must
be converted to the data type of the corresponding measure. Additionally, a date-time stamp, and a foreign
key for each of the dimensions, needs to be added. The foreign key for the Date dimension can be extracted
from the header of each of the GPS files. This data needs to be parsed and transformed to the smart key
format described above. The generated smart key may then be used for each entry loaded from the given
GPS file. Similarly, the fact table’s date-time stamp may be created from a combination of the date in the
GPS file’s header, and the timestamp attribute which is present in each line of a file. This is achieved by
parsing both attributes and transforming them into a single date-time data type. The foreign key, player-id,
requires a lookup table. The first and last name attributes in the GPS file may then be used to access the
player-id for each name in a file. The fact that some players are registered with more than one profile
needs to be taken into consideration when creating the lookup table. In other words, some ids need to be
mapped to by multiple names.
In order to guarantee that an entry in the fact table correctly links to entries in the Injury and Illness tables,
a temporary table is created. It stores entries for both injuries and illnesses as well as their respective
primary keys. If the date of a workload entry for a given player matches the date of an injury or illness
entry of the same player, the id is extracted from the temporary table and used as a foreign key in the fact
table. Numerous cases of injury and illness result in a players absence from training/competition, and for
this reason, there are no training entries which link to these particular cases. To ensure that all entries of
injury and illness are referenced by an entry in the fact table, non-training facts are created for each entry
of injury and illness which has no corresponding training data.
A similar process is required for the Training Log dimension, and the loading of the foreign key for this
dimension also makes use of a temporary table for accessing the id of a given training entry. Non-training
entries are also created in the fact table to ensure linkage to all training log entries. Finally, the foreign key
for the Session table is created in the same way as its primary key and is discussed in detail above.
37
4.4 Physical Design
The physical design of a data warehouse aims to enhance query performance. Three techniques are
commonly used to maximise performance, namely the use of materialised views, indexing and partitioning.
As an in-depth look at these techniques falls outside of the scope of this thesis, only a brief discussion of
materialised views and indexing is presented. Partitioning or fragmentation is particularly effective for
very large databases and divides the contents of a relation into several files. Given the size of the current
data set, partitioning is considered irrelevant for this thesis and is reserved for future work.
4.4.1 Materialised Views
In ROLAP, a materialised view is a view in which results are persisted in table-like form, and can be
queried like a regular relational table. They serve to enhance query performance by pre-calculating
costly operations such as aggregations and relational joins. As the enhanced performance is achieved at
the expense of additional storage costs, views are typically created for queries which are re-used. This
highlights one of the drawbacks of materialised views which require that designers can anticipate frequently
used queries.
One such query is identified in this thesis and involves aggregating workload data at the session level and
joining the results with information from the Date and Injury dimensions. The SQL query for creating the
materialised view is presented in Figure 4.13. Here it is seen that a view is created with all workload data
aggregated at the session-level. In addition to workload data, player and session-ids are also extracted from
the fact table. Workload data is joined with the Date dimension to extract the date of each session, and
with the Injury dimension to extract injury data associated with each session. Furthermore, the duration of
each session and the current session-count for each player is calculated.
Without the creation of a materialised view, the planning and execution times for generating the summary
table are 12.5 and 30519.9 milliseconds respectively. Extracting the same information from a materialised
view, however, yields planning and execution times of 0.3 and 0.8 milliseconds. The use of a materialised
view thus results in an execution time that is 38148 times faster than querying the data warehouse directly.
As the information extracted from the query presented in Figure 4.13 contains all the necessary information
required for the data mining presented in the next chapter of this thesis, a materialised view is considered
an optimal choice for this thesis.
38
CREATE MATERIALIZED VIEW summary_table AS

SELECT
d.training_date, t.player_id, t.session_id, i.injury_id,
i.type, i.injury_date, i.recovery_date, i.days_missed,
SUM(t.total_distance) as total_distance,
SUM(t.player_load) as player_load,
SUM(t.acceleration_load) as acceleration_load,
SUM(t.v4_distance) as v4_distance,
SUM(t.v5_distance) as v5_distance,
SUM(t.hsr_distance) as hsr_distance,
AGE(MAX(t.dt), MIN(t.dt)) as duration,
COUNT(t.player_id) OVER (PARTITION BY t.player_id) as sessions_played
FROM training_data t
JOIN date_dimension d ON t.datekey = d.datekey
JOIN injury i ON t.injury_id = i.injury_id
GROUP BY
d.training_date, t.player_id, t.session_id,
i.injury_id, i.type, i.injury_date, i.recovery_date
Figure 4.13: SQL code for generating a materialised view
4.4.2 Indexing
A key difference between OLTP and OLAP systems has to do with how they are designed to handle queries.
OLTP systems typically experience frequent transactions which only access a small number of tuples, and
often use indexing techniques such as B-Trees which are well suited these types of transactions. Typical
OLAP transactions are, however, more complex in nature and often access a very large number of tuples.
To support these types of transactions, alternative indexing techniques such as join indexes and bitmap
indexes are commonly used.
As the name suggests, join indexes materialise a relational join between two tables by precomputing
join pairs. These typically involve foreign key relations between the fact table and the dimension tables
and are naturally suited to star schema designs. As PostgreSQL does not offer support for join indexes,
B-tree indexes are used instead. B-tree indexes are created for all primary keys. Additionally, indexes are
considered for foreign keys with the intent of further improving join performance. However, the creation
of indexes for foreign keys in the fact table does not have any significant improvement upon performance,
and for this reason, are not implemented in this thesis.
Indexing columns in the fact table which are likely to be used in combination with selection operators
do however have a significant impact upon performance. Two columns, namely session-id and player-id,
are identified as prime candidates for such operations. These columns are expected to be used in queries
which aim to filter information for specific players or specific training sessions. An example of a typical
query involving summary data for a particular session is provided in Figure 4.14. The query represented
in the figure is used to select the total distance covered by players during a specified session. Similar
39
queries are often used to check whether a player’s recorded distance is similar to the rest of the teams for
a particular session. Such queries are particularly helpful in identifying GPS recording errors which are
relatively prevalent in this thesis. The creation of a B-tree index on session-id in the fact table results in
an execution time of the query seen in Figure 4.14 dropping from 2871 to 28 milliseconds, equating to a
ten-fold improvement.
A similar query may be used to select workload data for a specific player and aids users in gaining an
understanding of workload variation throughout the season. As the identified columns are expected to be
used frequently in queries involving the general understanding of workload data, indexes are created for
both player-id and session-id in the fact table.
SELECT
player_id, total_distance
FROM training_data
WHERE session_id = 25
GROUP BY
player_id, session_id;
Figure 4.14: A query for selecting data from a specific session
As bitmap indexes are particularly useful for improving query performance for columns with a low number
of distinct values, they may provide useful for the injury class, which only records two categories and
two injury types. Although PostgreSQL does not provide support for bitmap indexes, it does offer an
alternative indexing technique known as BRIN indexes. This thesis, however, does not to include the
indexing technique mentioned above for several reasons. Firstly, the injury category may be selected using
the injury-id column instead of the category column. This is possible because cases of injury are created
with an id-value greater than zero, whereas cases of non-injury are created with an id-value of zero. As
there already exists a B-tree index on the injury-id column, no further index is considered necessary for
the category column. An index is not created for the injury-type column as no queries are identified which
directly make use of this column. Lastly, due to the low number of injury tuples, additional indexes may
not provide significant increases in performance.
Finally, physical modelling is well studied within the field of data warehousing, and there are many
techniques which may further improve the performance of the data warehouse developed in this thesis.
Due to the relatively small size of the data set, as well as the limited scope of this thesis, measures beyond
those which are discussed above are considered relevant for future work.
40
4.5 The ETL Process
Extraction, transformation and load (ETL), are processes which take place in the back-end tier of a data
warehouse and are responsible for extracting data from an organisation’s sources, transforming them, and
then loading them into a data warehouse. In a high-level description of the ETL process, three distinct
phases are identified. The first of these involves the extraction of data from source data stores, which
are often heterogeneous, and may include operational databases, files, web pages, documents and even
stream data. In the second phase, data is transformed from the format of the sources to the format specified
by the data warehouse schema. This usually requires the propagation of data to a special-purpose area
of the warehouse called the Data Staging Area (DSA). The phase includes tasks such as, data cleaning
which removes errors and inconsistencies in the data, integration which reconciles data from different data
sources, and aggregation which summarises data to the level of detail of the data warehouse. The final
phase involves the loading of transformed data in a data warehouse. The ETL process typically refreshes
the data warehouse periodically, a process which involves propagating updates from the source data to
the data warehouse. The frequency at which a data warehouse is refreshed is specified by organisational
policies [47; 46].
The ETL process for this thesis is presented at the conceptual level using the Business Project Management
Notation (BPMN) as is proposed by A. Vaisman and E. Zimanyi [46]. As the notation does not specify the
implementation specifications of ETL processes, it allows users to focus on the characteristics of these
processes without having to understand technical details.
4.5.1 Overview of the ETL Process
The source data for this thesis comes in the form of CSV files, and a detailed description of their content is
provided in the section on source data. Furthermore, additional data sources are required for loading the
Player dimension, and relational tables must be created to ensure correct linking between entries in the
fact table and entries in the dimension tables. The sections which follow provide a step by step account of
each of the processes involved, as well a description of the data required to ensure the correct loading of
the data warehouse implemented in this thesis.
Figure 4.15 presents an overview of the entire ETL process for this thesis. Here is can be seen that
directly after the process begins, a parallel gateway giving rise to two branches is entered. The gateway
symbolises that the order in which the processes that follow are carried out is of no importance. However,
all processes between the parallel gateway and the merging gateway must be completed before proceeding
from the merging gateway. The left branch leads to the process responsible for loading the Date dimension.
The right branch enters another parallel gateway which in turn leads to three processes responsible for
data cleaning. Upon completion of all cleaning tasks, four dimensional loading processes are carried
out. Finally, after all of the dimensions have been successfully loaded, the GPS training data and special
members are loaded. The plus symbols in the figure illustrate that each of the processes involve several
sub-processes. These are explained in further detail in the sections which follow. Once all processes have
been completed, the end symbol is reached, symbolising the termination of the ETL process.
41
Figure 4.15: Overview of the ETL process
4.5.2 Cleaning of GPS Data
The cleaning process begins with all files from the three data sources, and through a series of tasks,
detects and corrects both corrupt and inaccurate records. Cleaning is performed using a combination of
interactivity, scripting and batch processing techniques which together ensure that the generated output
files, are correctly formatted for the loading processes which follow. In this and the following two sections,
the cleaning tasks presented in Figure 4.15 are described. Together they address all of the issues presented
42
in the section on source data.

The first of these processes involves the cleaning of GPS files and is illustrated in Figure 4.16. Using a
combination of Java programs and interactivity, the source files are used to generate two sets of output
files. In the one set, each file contains the training data recorded for a single player during a single session,
whereas the other set contains a file with the times and dates of each team session. For the purpose of
clarification, a team session includes the time interval during which at least one player was recording,
and for this reason, begins when the first player starts recording and ends when the last player stops
recording. Another point of importance is that a single source file contains data for all the players that
were present during a session, whereas the output files are split into separate files which only contain data
for a single player. The motivation for this is discussed in further detail below. Another potential output
for the cleaning process occurs when a source file does not meet specified requirements, resulting in the
file’s removal and the request of a new file with the correct specifications.
It can be seen in Figure 4.16 that the cleaning of GPS files begins with an iterative sub-process which is
responsible for performing parameter and interval checks on each of the input files. Interval checking
ensures that a file’s data meets the granularity requirements of one-second intervals. The task of parameter
checking ensures that the specified data parameters are present as well as that their ordering is consistent.
The termination condition is symbolised by the orange circle in the figure and is met once all input files
have been checked. The completion of the initial checks may result in one of three possible outcomes, a
situation which is illustrated with the use of an exclusive gate illustrated in red. The first of these is that a
file does not meet the specified requirements (missing parameters or incorrect granularity), in which case a
new file with the correct specifications is requested. A second possibility is that the ordering of parameters
is inconsistent. In such cases, a re-ordering task is carried out in which a new file with the correctly ordered
parameters is produced. The final possibility involves files which meet all requirements. These are written
to the same location as the re-ordered files and used as input files for the next task in the cleaning process.
As mentioned in Section 3.3, multiple cases of duplicate data exist as a result of a possible software issue
encountered when generating CSV files. For this reason, the next task in the cleaning process is responsible
for detecting duplicate entries. The iterative task compares all sessions which take place on the same day
and checks them for overlaps. It can be seen in Figure 4.16 that the detection of duplicates requires user
interaction before file discrepancies can be resolved. Duplicate files require careful inspection before
taking action to ensure that all data is is preserved, while at the same time avoiding the situation of data
being duplicated. Two examples of duplication are encountered in this project. The first occurs when both
duplicates contain the same data. The second occurs when one file contains more data than the other. In
both cases, one of the two files is discarded.
Files without any duplication issues proceed to the next step in the cleaning process, which involves a
sub-process responsible for the removal of "non-training data". The term non-training data in the context
of this project refers to lines of data before or after a session, in which all training parameters have the
value of zero and are of no interest with regards to analysis. It is important to clarify that these lines are of
no interest only if occurring before a session has begun or after it has ended. Lines of data with zero-values
found within the bounds of a session are however very relevant and are not considered as "non-training"
data. As a result of players forgetting to turn off GPS tracking devices, thousands of lines of irrelevant
data exist within the source files. To reduce storage demands as well as remove data that has no relevance
to the research of the project, lines of non-training data are removed in a process consisting of two tasks.
The first of these splits a file into separate files for each player. The resulting files thus contain data for a
single player from a single session. In addition to splitting files, file-headers are also reduced. The split
43
files are then trimmed, a task which removes all non-training entries before a sessions commencement and
all non-training entries after its termination. The start and end of training are defined as the first and last
lines in a file which contain non-zero values.
Two types of files are generated from these tasks. The first of these is a file which does not contain any
training data and is the result of the Catapult software generating data for players that did not participate
during a session. As can be seen from the figure, these files are discarded. The other type of file generated
contains one or more lines of data for a single player recorded during a single session. The first line of
data in these files represents a players’ first line of recorded training activity, which may occur after the
registered start time of a team session. This is a result of the fine granularity of the data which identifies
the exact second a training activity began for a given individual. It is this second file type which is used for
loading of the data warehouse and represents the first file output of the cleaning process.
The second output for the cleaning process is a file containing the dates and times of all team sessions and
is generated from the first set of output files. The task involves comparing individual sessions for a given
date and determining whether they can be grouped as a single team session, or whether they belong to
separate team sessions taking place on the same day. Individual sessions which are considered a member
of the same team session are compared so that the start and end times for the team session may be adjusted
to incorporate all individual sessions. In doing so, the start time for the team session is represented by the
earliest individual start time in the group, and the end time for the team session is represented by the latest
individual end time in the group. The output file for the above task thus contains a list of dates with the
start and end times of each team session and may contain multiple entries with the same date. In the case
of multiple entries with the same date, each entry represents a non-overlapping team session taking place
on the given date.
The cleaning of GPS data thus resolves three of the five training data issues identified in Section 3.3. The
issues addressed here are, the duplication of data, missing and wrongly ordered parameters, and entries of
non-training data. The remaining two issues, namely, players registered with multiple profiles, and players
without training data, are discussed below.
4.5.3 Cleaning of Injury and Illness Data
Two issues are identified in the injury and illness data, namely incorrect dates of injury and recovery, and a
missing injury type. The first of these involves recovery dates preceding the dates of injury for which three
cases are found in the source file. These issues are attributed to human errors and are resolved by switching
the dates. The second of the identified issues are resolved by consulting the teams medical personnel and
manually assigning a type to the injury entry.
44
Figure 4.16: Tasks involved in the cleaning of GPS training data
45
4.5.4 Cleaning of Log Data
The three issues identified among the log data, are resolved using scripting and interactivity techniques.
The first of these involves data entries which are spread over multiple lines and is caused by newline
characters added by players or team personnel when writing comments to data entries. To facilitate
the loading process, entries are represented on a single line, and for this reason, newline characters are
removed. The second issue encountered in the log data is that of players being registered with multiple
profiles. The above issue is addressed by creating a list of all registered profiles. This is used by team
personnel to identify all profiles which are related to the same player. A single profile is then decided
upon, and its name and id are used to replace the names and ids of all other profiles registered for the same
player. The final issue involving the log data is that which concerns players who have log entries but no
training data. This problem is resolved by creating a single non-training tuple in the fact table and is dealt
with in the section on special members.
4.5.5 Load Date Dimension
Loading of the Date dimension is presented in Figure 4.17, and includes three tasks. The first of these
involves the creation of a match-day lookup. This is achieved using a Python script, which parses a text
file containing the dates of all matches, and creates a list of competition dates. The remaining two tasks are
grouped into a sub-process which repeats until a termination condition is reached. The first of these two
tasks also makes use of a Python script, which parses a text file containing the years for which date-entries
are to be created. For each day of a given year, a date tuple is created and then inserted into the Date table.
The value of the "Matchday Indicator" attribute in each date-tuple is set using the list of competition days
generated in the first task. This process continues until the entire text file has been parsed, and a tuple for
every date has been inserted into the database.
Figure 4.17: Loading of the Date dimension
46
4.5.6 Load Session Dimension
The Session dimension is loaded from a file produced during the cleaning of GPS data. This file includes
the dates and times of each session. Before session data is loaded, a non-session tuple is first generated
and inserted into the Session table as a single entry which can be used for fact entries that do not reference
a specific training session. An example of this is when a player is absent from training as a result of either
an injury or an illness. In this case, a fact entry is required to bind an injury/illness tuple to a player-id
in the Player dimension as well a date entry in the Date dimension. Without such a fact entry, important
details would be unavailable to users. A fact entry is thus created, and as a result of a table’s foreign key
constraints, non-entry values must be available for the appropriate dimensions. This is the case for all
dimensions in the data warehouse with the exception of the Player and Date dimensions. This step is only
applicable for the first loading of the warehouse, after which the same entry is kept for subsequent loading
updates.
Figure 4.18: Loading of the Session dimension
47
Once the loading of the non-session entry has been completed, the loading process enters an iterative
sub-process which parses the session file mentioned above, creates a session tuple, and loads it into the
warehouse. One additional task in this sub-process involves the loading of session data into a temporary
session table which includes date-time stamps for the start and end of each session. These attributes are
required during the loading of the fact table to link fact entries to the correct session entry. This process
is explained in detail in the section on the loading of the fact table. The process of loading the Session
dimension terminates once all sessions in the file have successfully been loaded into the data warehouse.
4.5.7 Load Player Dimension
Loading of the player dimension is a user intensive task and requires multiple input files to ensure that all
players are accounted for in the data warehouse. An overview of the process is illustrated in Figure 4.19.
In the first step of the process, a list of all player names and ids is generated from the three source files.
The initial list produced in this step contains issues such as duplicate profiles as well as profiles without
player ids. These issues are resolved in the next task, which results in two new lists being produced. The
first of these is a set consisting of all unique profiles present among the three data sources and can be
thought of as a union of all player profiles.
Player pro f iles = Pro f iles Log data ∪ Pro f iles In jury data ∪ Pro f iles GPS data (4.1)
Profiles in the above list are represented by unique profile-ids which are used to identify players in the
data warehouse. The second list produced is a mapping from player names in the GPS file to profile-ids
in the warehouse and is used during the loading of the fact table to map training entries to the correct
player-id. Existing ids are used for players who are registered in the AthleteMonitoring software, whereas
new ids must be created for players who do not have registered profiles. The third task in the process
is responsible for adding biometric features to the list of player profiles. It serves as the input for the
sub-process responsible for loading player data in the player dimension. This sub-process iterates over
every profile in the list, creating player tuples for each player before loading them into the warehouse. The
loading of the Player dimension thus terminates once all profiles have been loaded.
48
Figure 4.19: Loading of the Player dimension
4.5.8 Load Injury and Illness Dimensions
The Illness and Injury dimensions are loaded from a CSV file generated by the AthleteMonitoring software.
The entire process is illustrated in Figure 4.20. From the figure, it can be seen that the loading process
begins with a single task followed by an iterative sub-process involving four tasks. On completion of the
loading process, all entries have been loaded into their respective tables. Additionally, tuples representing
the events of a non-injury and a non-illness are also loaded.
Starting from the green circle at the top of the figure, it can be seen that the first task involves the loading
of a non-injury and a non-illness tuple into their respective tables. The purpose of such tuples is to ensure
that foreign key constraints in the fact table are satisfied. These constraints require that every entry in the
fact table references a single entry in both the injury and illness dimensions. For this reason, non-injury
and non-illness tuples are created for fact entries which are not associated with an injury or an illness.
Completion of this task leads to the commencement of the sub-process responsible for loading file entries
into the data warehouse. This is an iterative process which begins with the reading of a single entry from the
CSV file. An injury/illness tuple is created and inserted into a temporary table with an auto-incrementing
primary key. The primary key generated in the temporary table is used as the primary key when the tuple is
loaded into the injury or illness table. The temporary table thus enables mapping of training entries to the
49
correct injury and illness tuples in the data warehouse. This is done by looking up the date and player-id
attributes of an entry in the temporary table and using its primary key to locate the correct entry in the
Injury or Illness dimension.
Once the temporary entry has been added, the next task involves extracting the most recently-added
entry’s primary key, as is mentioned above. The final task in the iterative sub-process is dependant upon
whether the current entry represents an injury or an illness. This is illustrated in the figure by an exclusive
gateway which symbolises that only one of the two remaining tasks can be executed for each iteration
of the process. Dependant upon whether the current entry being processed is an illness or an injury,
the appropriate attributes are selected and inserted as a tuple into the data warehouse. The sub-process
continues until every entry in the CSV file has been parsed, and a tuple for each entry has been inserted
into the corresponding table of the data warehouse. This is expressed by the conditional event highlighted
in orange in the figure and symbolises the loops termination condition. Once this has been fulfilled, the
entire process then terminates as is illustrated by the red circle.
Figure 4.20: Loading of the Injury and Illness dimensions
50
4.5.9 Load Log Dimension
The loading process for the Log dimension is very similar to that of the loading process for the Injury
and Illness dimensions, and tuples in the log table are generated from a CSV file produced by the
AthleteMonitoring software. This process is very similar to the one presented in Figure 4.20 and begins
with the generation of a log tuple from an entry in the CSV file. The tuple is then inserted into a temporary
log table which has an auto-incrementing primary key. As with the process above the temporary table
is created to ensure correct mapping between fact and log entries. Once a tuple has been added to the
temporary log table, its primary key is extracted and used as the primary key for the current entry. The
iterative process terminates once all entries in the CSV file have been loaded into the data warehouse.
4.5.10 Load GPS Data
Loading of the fact table may only begin after the loading processes for the dimension tables have
completed. This is a result of foreign key constraints in the fact table which require the existence of a
primary key in a referenced dimension table before tuples may be added to the fact table. The loading
processes described above thus guarantee that all foreign key values exist as primary key values in the
relevant dimension tables before the loading of GPS data begins.
An overview of the loading process is presented in Figure 4.21 and begins with a single task before an
iterative sub-process consisting of eight tasks is carried out. The process continues until all of the specified
GPS files have been parsed and loaded into the data warehouse.
The first task in the process is responsible for establishing a player-id lookup which is created from the
mapping file produced during the loading of the Player dimension. The lookup maps player names in GPS
files to player ids in the data warehouse and is used for each file in the GPS loading process.
Once the id-lookup has been created, an iterative sub-process responsible for creating and loading fact
tuples is carried out. The input for each iteration is a GPS file which is produced during the cleaning of
GPS data. During parsing, the player name is used to obtain the player-id from the id-lookup created above.
The next two tasks in the iterative process involve extracting the illness and injury ids from the temporary
injury-illness table and requires the date and the player-id from the file being parsed. The session-id is
then extracted from the temporary session table. The relevant session-id is located using a GPS entry’s
date-time stamp as a means of identifying which session the entry belongs to. This is done by comparing
an entry’s date-time stamp with the start and end date-time stamps of individual session entries. The final
foreign key lookup in the sub-process is that for the log-id and is extracted from the temporary log table
using the current GPS entry’s player-id and date to locate the relevant entry. In the case of injury, illness
and log ids, if no entry is located a value of zero is used which references special entries (non-injury,
non-illness and non-log) in each of the respective dimensions. This is not applicable for the extraction
of session-ids as each GPS entry is expected to belong to exactly one session entry. Once all foreign key
values have been extracted, a tuple for each entry in the GPS file is created and inserted into the database.
An important point of clarification is that the same foreign key values are used for every entry in a given
GPS file. The reason for this is that each GPS file represents a single session for a single player on a given
date. Hence all entries in a given file reference the same entries in each of the dimension tables. It is for
this reason that a combination of foreign keys cannot be used as a primary key for the fact table despite it
being common practice in data warehousing. The process of loading GPS data terminates once all the
specified files have been loaded into the fact table.
51
Figure 4.21: Loading of the fact table
52
4.5.11 Insert Special Members
The final task in the ETL process involves inserting special members into the fact table to ensure that
every tuple in the Injury, Illness and Log dimensions are referenced by at least one tuple in the fact table.
Foreign key constraints in the fact table ensure that every entry in the fact table references an entry in the
corresponding dimension table. There is however no constraint in place which ensures that an entry in a
dimension table is referenced by an entry in the fact table. Due to the nature of the data being modelled,
this may result in issues for users when trying to access information from the data warehouse. An example
of such an issue occurs when a player is absent from training as a result of an injury or an illness. This
would result in an entry in either the Injury or Illness dimension with no entry in the fact table to reference
it. Without making changes to the data warehouse, a user would be able to access the injury or illness entry
without being able to access the corresponding player-id. Similarly, if there were to exist no entry in the
fact table for a given log entry, users would be unable to obtain the corresponding player-id or date for the
given entry.
To avoid such issues, special members are inserted into the fact table to ensure that every entry in the
above-mentioned dimensions are referenced by at least one tuple in the fact table. The process for achieving
this is similar for all three dimensions and involves iterating over every entry in a given dimension and
checking whether it is referenced by an entry in the fact table. If no reference exists, a fact tuple is
created by retrieving the relevant data through a series of lookups in the temporary databases to extract
corresponding dates and player ids. Additional lookups are also carried out to extract possible ids for
corresponding entries in the other dimension tables.
An example of this is when no fact entry is found for a given injury entry. The player id corresponding to
the injury entry is then retrieved from the temporary injury-illness table. Further lookups are carried out to
retrieve entry ids for entries that may exist in the illness and log tables for the same player on the same
date.
Once the necessary data has been retrieved, a tuple is created and inserted into the fact table. This tuple
contains the relevant foreign keys for the given player on a given date. A point of importance is that the
foreign key for the session dimension is always set the value that references a non-session entry in this
dimension. Furthermore, all measures are set to a value of zero.
53
Chapter 5
Data Mining
This chapter presents an account of the data mining approach taken in this thesis. Data mining is used in an
effort to achieve the second goal of the thesis, which aims to predict the onset of player injury. Following
the CRISP-DM process model, a five-phase approach is taken. Starting with the business understanding
phase, data mining goals are clearly specified, before beginning with the data understanding phase, in
which exploratory data analysis is performed on data extracted from the data warehouse. A final data
set is then prepared in the data preparation phase, and is used for building and testing several prediction
models in the modelling phase. Finally, models are evaluated using several performance measures, and an
explanation of their performance is presented.
5.1 Business Understanding
Business understanding in the context of this study is synonymous with the analysis goal of this thesis,
which is defined as the assessment of a model’s ability to predict future injuries from workload and injury
data. The motivation for this work is inspired by two studies which are discussed in Section 2.4. Elements
from each of these studies are included in this work to test the reproducibility of previous modelling
approaches on a new data set.
More specifically, the aim of this thesis is defined in terms of the following objectives:
• To assess several machine learning models in their ability to predict whether a player will get injured
in the next session.
• To assess whether different definitions of injury affects the ability of a model to detect injury [5].
• To assess how a model’s predictive performance is affected when the data set is reduced to a subset
of previously identified features [39].
• To assess whether the use of feature extraction improves a model’s predictive performance [39].
55
CHAPTER 5. DATA MINING
5.1.1 Supervised Learning
Statistical learning problems typically fall into one of two categories, namely supervised or unsupervised
learning. The modelling problem in this thesis naturally falls into the category of supervised learning,
as each predictor measurement xi is associated with a response measurement yi [22]. More specifically,
each set of GPS features is associated with an injury response measurement, representing whether or not
the given player got injured. The objective of learning is to fit a model that relates the response to the
predictors with the aim of accurately predicting future responses [22]. Supervised learning problems are
further characterised as being either quantitative or qualitative. In the case of this thesis, the response
variable is considered qualitative, meaning that each response falls into one of K classes. Furthermore, as
the response variable falls into one of two classes, injured or not-injured, the learning problem for this
thesis is characterised as a binomial classification problem.
There exist a wide variety of classification models, each of which presents a trade-off between accuracy
and interpretability [22]. More restrictive modelling techniques are often easy to interpret, a quality which
is of great practical value for both the coaching staff and players. The ease of understanding offered by
such techniques often comes at the expense of their lower prediction accuracy when compared to more
flexible "black box" approaches. As predictive accuracy is of vital importance, and interpretability is
highly desirable, a variety of modelling techniques are employed in this thesis. Previous work has been
able to achieve good results using decision trees, a technique which provides a good balance between
interpretability and accuracy [39].
A fundamental aspect of classified learning involves splitting the data set into a test and a training set. One
of the major challenges encountered in this study is that of an unbalanced data set, meaning that there are
a disproportionately large number of non-injury cases compared to the number of injury cases. Skewed
data sets compromise a classifiers ability to learn because of a model’s tendency to focus on the prevalent
event and ignore the rarer one [33]. Methods for correcting the issue of class imbalance can be grouped
into two general strategies, correction techniques at the learning level, and correction techniques at the
data level. Techniques at the learning level aim to strengthen a learning model’s ability to identify the
minority class. Data level techniques focus on altering class distribution by randomly oversampling the
minority class, undersampling the majority class, or by creating artificial samples of the minority class
[33]. This thesis adopts the strategy of generating synthetic data in an attempt to correct the non-injury
class imbalance. Specifically, an adaptive synthetic sampling (ADASYN) approach is taken, in an effort
to avoid the possible loss of useful data which may result from undersampling, or the potential risk of
overfitting which may occur from oversampling.
5.1.2 Injury Definition
This study focuses on a sub-set of the season’s injury data known as non-contact injuries. These are defined
as injuries occurring without extrinsic contact with another player or an object on the field. Furthermore,
this thesis considers two different types of non-contact injury [5]. The first of these, non-contact (NC),
includes all injuries which fall into the category of the definition above. The second, non-contact resulting
in time loss (NCTL), is defined as "an injury that occurred during a scheduled training session or match
that caused absence from the next training session or match" [14]. Separate models are created for both
definitions to determine whether a model performs better on specific injury types.
56
5.1.3 Data Granularity
One of the advantages provided by a data warehouse, is the ease at which data can be extracted at different
levels of granularity. The granularity chosen for this study is that of the session-level, meaning that
workload data is aggregated for each session completed by an individual player. This is a natural choice as
injury data is recorded with the date of its occurrence being the finest level of time granularity available.
Positive injury labels are thus created for the last session completed by a given player before the date of
their next recorded injury.
5.2 Data Understanding
In this section, workload and injury data are extracted from the data warehouse developed in the previous
chapter, and exploratory data analysis (EDA) is performed to gain insights into the data. A summary of
workload data is initially presented before applying smoothing to correct GPS recording errors. Several
summary statistics and a visual representation of the season’s injuries is then presented. Finally, comparis-
ons between the workloads of injured and non-injured players are made in an attempt to gain insight into
possible relationships that might exist between workload and injury data.
5.2.1 Summary Statistics for the Workload Features
A summary of workload data aggregated at the session-level is presented in Table 5.1. Data is summarised
from 3005 sessions completed by 38 players during the 2019 season. The data represented in the table
includes summary statistics for the six GPS features as well as the number of sessions completed by
players during the season.
It is evident from the summary data that several data quality issues exist within the data set. This issue is
highlighted by the highly skewed values that are seen for features such as, distance, acceleration load, V5
distance, and HSR distance. The highly skewed distribution is primarily accounted for by the unrealistically
large maximum values of these features. Closer inspection of the data suggests that these values can be
explained by faulty GPS readings produced by a single GPS on three separate occasions.
Another issue highlighted in the summary table is that of GPS units under-recording, and is illustrated
by the presence of minimum values of zero. On closer inspection, 174 sessions are identified as having
Variables Mean Stdev Median Min Max Range Skew

Distance 8965.82 131905.50 4493.01 0.00 4657235.00 4657235.00 32.46
Player Load 495.51 276.58 460.18 0.00 1761.80 1761.80 0.83
Acceleration Load 1482.30 801.07 1415.06 0.00 20280.90 20280.90 4.61
V4 Distance 174.47 183.78 121.84 0.00 1328.64 1328.64 1.66
V5 Distance 71.23 999.07 16.28 0.00 54472.26 54472.26 53.77
HSR Distance 245.70 1023.56 146.43 0.00 54472.26 54472.26 49.55
Session Count 81.22 51.31 75 1 167 166 0.13
Table 5.1: Summary statistics
57
unrealistically low values across all GPS features. These low values can be explained by GPS units turning
off during sessions as a result of device faults or low battery levels. The data issues mentioned above are
classified as noise and missing values respectively and may prove harmful to prediction models if not
corrected in advance.
Correction of data quality issues are typically carried out during the data preparation phase of the data
mining cycle. The data quality problems are, however, corrected in this phase to facilitate ease of reading.
There are a wide variety of techniques for correcting noisy data and missing values [17]. Two approaches
are considered in this thesis, namely, removal of incorrect values, and smoothing of incorrect values by
bin means. In the next phase of the data mining cycle, engineered features are introduced. Due to the
dependence of these features upon values from previous training sessions, it is considered more harmful to
exclude values than to replace them with smoothed averages. For this reason, the approach of smoothing
by bin means is adopted, and training sessions with recorded values falling outside of a defined range are
replaced by the mean values of all correct recordings from the same session.
Table 5.2 presents a summary of the GPS data after smoothing by bin averages has been applied. It can
be seen that the smoothing technique results in a drastic improvement in skewness values for all of the
problematic features identified above. The improvement is attributed to correcting unreasonably high
maximum values and unreasonably low minimum values. An advantage of the smoothing technique used,
is that outliers from an individual session are only corrected when they belong to a team session which
falls within the range of normal. In other words, the technique targets individuals outliers which may
represent an incorrect recording of a single GPS unit. If the average values for an entire team session were
to fall outside of the range of normal, this would suggest a correctly recorded session, and the recorded
values of individual sessions would remain unchanged. This concept is highlighted by the distance feature,
which has a minimum value of 500m, despite a lower bound of 1000m being specified. This value remains
unchanged as it belongs to a team session in which all values were unusually low.
The features, V4, V5 and HSR distance only record for running speeds above 20km/h. Players do not
achieve these speeds in every session. Hence a minimum value of zero is still observed for these features
after smoothing has been applied. Finally, the majority of features have distributions which are considered
to be highly positively skewed (>= 1), meaning that the mass of the distributions are concentrated towards
the left of each distribution and that the distributions have long right tails.
Variables Mean Stdev Median Min Max Range Skew

Distance 5132.23 2546.09 4640.99 500.06 14209.09 13709.03 1.08
Player Load 526.10 261.44 478.47 31.33 1761.80 1730.47 1.00
Acceleration Load 1555.72 655.87 1461.20 138.37 3982.75 3844.38 0.57
V4 Distance 186.57 183.43 133.63 0.00 1328.64 1328.64 1.57
V5 Distance 56.49 112.38 19.93 0.00 2501.57 2501.57 10.10
HSR Distance 243.07 261.29 163.54 0.00 3191.37 3191.37 2.38
Session Count 81.22 51.31 75 1 167 166 0.13
Table 5.2: Summary statistics after smoothing by bin means
58
Statistic Non-contact (NC) Non-contact time-loss (NCTL)

Total Count 45 14
Players Injured 22 11
Players with 1 Injury 11 8
Players with 2 Injuries 3 3
Players with 3 Injuries 5 0
Players with more than 3 Injuries 3 0
Days missed 551 551
Table 5.3: Injury statistics
5.2.2 Summary Statistics for the Injury Data
A summary of injury data for both definitions of non-contact injury is presented in Table 5.3. As described
in Section 5.1, two definitions of injury are considered, namely, non-contact (NC) and non-contact causing
time loss (NCTL). The former of the two definitions naturally has a higher number of injuries when
compared with the stricter definition, and it can be seen that there are over three times as many NC injuries
as there are NCTL injuries. In addition to NC injuries occurring more frequently, a higher incidence of
players incurring multiple injuries is also observed for this definition of injury. Half of all the players that
received NC injuries were injured on more than one occasion.
Figure 5.1 provides a visual perspective of player injuries in the form of a Gantt chart. NCTL injuries are
represented in orange, whereas NC injuries are represented by both the blue and orange bars in the chart.
Each bar in the chart represents the date and duration of an injury for a given player. The figure provides a
clear indication of the sequence of injury occurrence through the season.
The span of the orange bars represent both the duration of a player’s injury as well as the number of days
a given player was absent from training and competition. The span of the blue bars, however, indicates
only the duration of a player’s injury, and not the number of missed sessions. The chart highlights several
important issues that must be considered in the next phase of the data mining cycle. Firstly, multiple
incidents of injury occur very early in the season, and may need to be excluded from the final data set as a
result of there being insufficient workload data prior to their occurrence. Secondly, two injuries resulted in
the affected players being absent from training/competition for approximately half of the season’s duration.
As a result of their absence, there is limited workload data for these players, a factor which could influence
model performance. Another consideration worth noting is that 8 of the 14 NCTL injuries are preceded by
another injury.
5.2.3 Comparison of Workload Features for Injured and Non-Injured Players
A comparison of load features for non-injured players and players that received NC injuries is presented
using boxplots in Figure 5.2. Workloads of players who did not get injured during the season are
represented in orange, whereas the workloads for players who did receive injuries are represented in blue.
It can be seen in both Figures 5.2a and 5.2b that the load features for injured and non-injured players are
almost identical. The similarity between the figures’ median values, interquartile ranges (IQR), as well
as their minimum and maximum values do not provide much insight into possible relationships between
load features and non-contact injuries. One point worth noting is that players with injuries had a greater
59
Figure 5.1: Visual overview of injury occurrence for the 2019 season
number of outliers for both workload features. The fact that these are present in both features highlight the
correlation between the two features, both of which are calculated from acceleration data provided by the
GPS’s accelerometer.
Similarly to above, a comparison of distance features for non-injured players and players that received NC
injuries is presented using boxplots in Figure 5.3. Median distance values in Figure 5.3a are very similar
for both injured and non-injured players. This is particularly surprising as players carrying injuries are
expected to have lower distance values as a result of reduced participation caused by injury. Despite the
apparent similarity between the boxplots presented in Figure 5.3b, the observed differences between the
HSR distances of non-injured and injured players are the most noteworthy of all workload features in this
study. This is an interesting observation as higher HSR values have been associated with the incidence of
injury [39].
Finally, a comparison of distance features for injured and non-injured players with respect to the NCTL
definition of injury is presented. It is seen in Figure 5.4 that the stricter definition of injury yields almost
identical boxplots for both distance features. This is again surprising for the distance feature, as players
with more severe injuries are expected to have lower average mileages as a result of lower training loads
after returning from injury. In contrast to the differences in HSR distance observed for players with NC
injuries, almost no difference is seen between the HSR distances of non-injured players and players with
NCTL injuries. Due to the similarity of load values for players with NCTL injuries and non-injured players,
no boxplots are included.
An initial comparison between non-injured players and players with NC and NCTL injuries does provide
much insight into potential relationships that may exist between workload features and injury. Workload
features are similar for all features and both definitions of injury, with the difference in HSR distance
between non-injured players and players with NC injuries being the most notable.
60
(a) Player Load (b) Acceleration Load
Figure 5.2: Boxplots comparing the workload features of injured and non-injured players with respect to
NC injury
(a) Distance (b) HSR Distance
Figure 5.3: Boxplots comparing the distance features of injured and non-injured players with respect to
NC injury
(a) Distance (b) HSR Distance
Figure 5.4: Boxplots comparing the distance features of injured and non-injured players with respect to
NCTL injury
61
5.3 Data Preparation
The data preparation phase involves constructing a final data set from the data presented in the previous
phase. The final data set consists of the six original GPS workload features, twelve additional relative
workload features, and a single injury feature calculated from the player injury data. Given the discrete
nature of the original data set, data transformations are required to provide an indication of a player’s
accumulated workload. Several well-studied methods for calculating the relative workload of a player are
discussed in detail below.
5.3.1 The Exponentially Weighted Moving Average (EWMA)
The exponentially weighted moving average was first introduced as a control scheme for detecting small
shifts in the mean of a process [37]. More recently it has also been proposed as a method for calculating the
relative workloads of athletes, to account for the decaying nature of fitness and fatigue over time [49]. This
is achieved using a decay factor that weights the importance of recent workloads, where recent workloads
are weighted more heavily than older ones. The result is thus a smoothed average of a given workload. A
simple representation of the formula is given as:
EW MAtoday = Loadtoday × λ + ((1 − λ) × EW MAyesterday ) (5.1)
where:
0≥λ≤1
λ represents the degree of decay, with higher values discounting older values at a faster rate. It is commonly
represented as:
λ = 2/(N + 1) (5.2)
where:
N is referred to as span, and represents the chosen decay constant
EWMA features are created for each of the six GPS workload features using a span of 6 [39; 5]. EWMA
calculations account for non-training days by using a value of zero. Thus players with fewer sessions will
have lower EWMA values for a given period than players with more sessions (assuming that both sets of
players have similar workloads for each session).
5.3.2 The Mean Standard Deviation Ratio (MSWR)
The mean standard deviation ratio is a commonly used technique to quantify the monotony of a player’s
workloads [39; 4; 1; 5]. It is defined as the ratio between the standard deviation and the mean of a given
workload for a specified period. MSWR values thus represent the variation in a player’s workload, with
higher values indicating more monotonous workloads for the specified period.
µt
MSRWt = (5.3)
σt
62
where:
µt is the mean of a workload feature for time period t
σt is the standard deviation of a workload feature for time period t
MSWR features are created for each of the GPS features as an average of a feature’s workload for the
previous seven days divided by the standard deviation of the workload for the same period [39; 4; 1; 5].
The MSWR calculation includes all calendar days since the first recorded session for each player and uses
a value of zero for days on which no workload data is recorded. As a result of the feature requiring seven
calendar days of workload data, the first six calendar days since a players first session are not included in
the final data set.
5.3.3 Injury Feature
The injury feature created in this study takes into account the number of previous injuries a player has
received, as well as the number of days since a player’s return to play after recovering from their most
recent injury [39]. Its calculation involves modifying the EWMA presented above, which can be written as
a moving average of past and current observations [31]:
i−1
Zi = λ ∑ (1 − λ) j Xi− j + (1 − λ)i Z0 (5.4)
j=0
where:
Zi is the calculated EWMA for i sessions
Xi is the recorded value for session number i
λ is a value from 0 to 1
Z0 is the starting EWMA value
By setting Xi to the number of injuries a player has had to date, and Z0 to Xi − 1, the injury feature can be
calculated as follows:
d−1
Id = λ ∑ (1 − λ) j Xi + (1 − λ)d Xi−1 (5.5)
j=0
where:
Id > 0 for Xi >= 1 and Id = 0 for Xi = 0
d is the number of days since a players return to play since their last injury
A feature value of zero thus means that a player has never been injured, whereas a value greater than zero
indicates that a player has had at least one injury during the season. For players with one or more injuries,
the feature increases for each day that passes since the players return to play. For a player with a single
injury, the value of the feature starts from a value of 0 and increases for each day passed until it eventually
reaches a value of 1. Similarly, for a player with two injuries, the value of the feature grows from 1 to 2,
increasing every day until the value of 2 is reached. The rate of growth of the feature is determined by λ,
with larger values resulting in faster growth rates.
63
The injury feature thus serves as an indication of both the number of injuries a player has received as well
as the number of days since returning to play after receiving their last injury.
5.3.4 Creation of the Final Data Set
Two data sets are constructed for the modelling phase, one for the modelling of NC injuries and the
other for the modelling of NCTL injuries. Both data sets include six GPS workload features, six EWMA
workload features, six MSWR workload features, one injury feature, and a label indicating whether the
player got injured after the current session. Each of the data sets consists of data from a total of 2848
sessions completed by 32 players and represents a subset of the workload data presented in the previous
phase. Table 5.4 presents an overview of several differences between the data sets generated for modelling
and the data set presented in the section on data understanding.
Statistic ML Data Set) Data Warehouse

Number of sessions 2848 3005
Number of players 32 38
Number of NC injuries 34 45
Number of NCTL injuries 13 14
Table 5.4: Comparison of ML data set and data from data warehouse
From the table above, it can be seen that multiple tuples are excluded in the process of creating the data
set to be used for modelling. Among the data excluded are workload data from 6 of the 38 players. Data
from these players are excluded due to them having an insufficient number of recordings. Additionally, all
recorded sessions from January are excluded because of the extensive time gap between the last session in
January and the first session in February. As mentioned in the section on MSWR workloads, all sessions
within the first six days since a player’s first recorded session are excluded from the final data set. The
above action is performed to remove MSWR values of zero.
Furthermore, it can be seen that neither NC nor NCTL data sets include all injuries from the data warehouse.
In the case of the NC data set, injuries are excluded if they took place very early in the season, or if an
injury is associated with a player that is not included in the final data set. The NCTL data set loses one
case of injury as a result of the exclusion of a player with insufficient workload data.
The final data set generated in the data preparation phase is thus a matrix of 2848 vectors, of which each
vector contains 18 workload features, an injury feature, and a label. A visual representation of the data set
is presented in Figure 5.5.
 
W F11 W F12 ... W F118 IF119 L120
 W F21 W F22 ... W F218 IF219 L220 
 

 .. 

 . 
W F28481 W F28482 . . . W F284818 IF284819 L284820
Figure 5.5: Representation of the final data set constructed in the data preparation phase
64
5.3.5 Correlation Matrix for the Features in the Final Data Set
Correlation is a measure of the extent to which two variables change together. A correlation matrix for the
final data set is presented in Figure 5.6 and provides an indication of the degree of correlation between
each pair of variables in the data set.
Several interesting relationships are seen with the aid of the figure. Firstly, there is a weak correlation
between the injury feature and all of the workload features. The feature is thus considered independent
from all other features and may provide useful as a predictor variable in the modelling phase. Secondly,
there are 16 highly correlated relationships (> 0.80) between variables, meaning that if highly correlated
features were to be removed before modelling, this would result in the removal of eight features from
the data set. One set of highly correlated features is that of total distance, player load, and acceleration
load. High correlation values are also observed for the EWMA and MSWR values of these features. These
relationships are not surprising as both player load and acceleration load values are expected to increase
with an increase in the total distance covered by a player.
Another set of highly correlated features is that of HSR distance and V4 distance. This is also not surprising
given that HSR distance is the sum of V4 and V5 distances. The fact that V4 distance has a stronger
correlation to HSR distance than that of V5 distance indicates that V4 distance forms a more significant
portion of the HSR distance than V5 distance does.
Figure 5.6: Correlation matrix of the injury and workload features in the final data set
65
5.4 Modelling
This section provides a detailed account of the entire modelling approach taken in this thesis and includes
a description of the models and techniques which are used. Four modelling approaches are applied to
both NC and NCTL data sets. Each of these approaches uses four classification algorithms of varying
complexity and interpretability, to build predictive models for each of the two data sets. The section
begins with a brief presentation of the four classification models, before providing an explanation of the
techniques used to account for unbalanced data, overfitting, and multicollinearity. Several key concepts
which are reproduced from a previous study are then explained. Finally, a presentation of each of the four
modelling approaches is provided.
5.4.1 Model Selection
• Decision Tree (DT) classification is a non-parametric approach, in which models are fitted by
learning simple decision rules inferred from predictor variables. In so doing, the predictor space
is segmented into several simple regions. Prediction for a given observation thus involves taking
the most commonly occurring class of the training observations in the region to which it belongs
[22]. Decision Trees are considered an ideal approach for gaining insight into injury and training
workloads, due to the relative ease at which they may be interpreted. Additionally, their accuracy is
not adversely affected by the presence of redundant attributes, a phenomenon that is likely to be
encountered among workload features [44]. Possible drawbacks include the classifiers susceptibility
to lower prediction accuracy when compared with other classification techniques [22]. DT classifiers
are also susceptible to the inclusion of irrelevant attributes in the tree building process [44].
• Random Forests (RF) are an ensemble learning method for classification which builds upon the
concept of decision trees in an effort to build more powerful prediction models. The algorithm
works by building several decision trees based on bootstrapped training samples. In contrast to
building classical decision trees, each time a split in a given tree is considered, a random sample
of predictors is chosen as split candidates. These candidates represent a subset of the full set of
available predictors, and a new subset of predictors is chosen for each split. This technique reduces
the tendency of a model to overfit to its training set by decorrelating trees. The average of the
resulting trees is thus less variable and more reliable [22].
• Support Vector Machines (SVM) separate between classes using a subset of predictor variables
known as support vectors. The basic principle involves the construction of a hyperplane, which
aims at maximising the margin between itself and the support vectors. Decision boundaries with
larger margins tend to have lower generalisation errors than those with small margins [44]. The
following principle can be extended to accommodate non-linear decision boundaries by enlarging
the feature space to a higher dimensional space through the use of kernels [22]. Although support
vector machines have shown promising results in many practical applications, they can be ineffective
in determining class boundaries when dealing with unbalanced data sets [51].
• Logistic Regression (LR) lends itself well to the binary nature of the problem being modelled in
this thesis and is a common choice for modelling the outcome of injury [5]. By making use of a
logistic function, this algorithm produces an S-shaped prediction curve, which limits predictions
to a value between 0 and 1. Model fitting involves the process of estimating a set of regression
66
coefficients that maximise a given likelihood function. Once this has been done, injury predictions
are made by computing its probability from a set of predictor variables [22].
5.4.2 Adaptive Synthetic Sampling
Adaptive synthetic sampling (ADASYN) is used before building models to correct the severe class
imbalance, which is present in both NC and NCTL data sets. At a high level, ADASYN corrects class
imbalance by generating synthetic examples of the minority class. The sampling technique facilitates
learning from imbalanced data sets by reducing bias and learning adaptively. A vital concept of the
technique involves the use of a density distribution. The distribution is used for determining the number of
synthetic samples that need to be generated for each of the predictor variables in the minority class. This
particular strategy ensures that more synthetic data is generated for the minority class in neighbourhoods
dominated by the majority class. Thus, the resulting data set is balanced, and learning algorithms are
forced to focus on harder-to-learn examples [18]. One potential weakness that may result from the adaptive
nature of this algorithm is that neighbourhoods with few examples of the minority class may end up with a
lot of synthetic data, which is very similar to that of the majority class. This can affect the precision of
learning models as a result of too many false positives.
5.4.3 Stratified K -Fold Cross Validation
Cross-validation is a resampling method that can be used to assess a model’s generalisation ability on
a limited set of data. In k-fold cross-validation, the training data is divided into k disjoint subsets of
approximately equal size. The model is fitted on k − 1 of the subsets, which together form the training set.
The remaining subset is considered the validation set and is used to assess the model’s performance. This
procedure is repeated k times until each subset has served as the validation set. A models cross-validated
performance is calculated as the mean of the k measures [3]. Stratified cross validation is particularly
important when dealing with unbalanced data sets, as it ensures that each class is approximately equally
represented across all subsets. Without this guarantee, there is a risk of one or more subsets not having
any examples of the minority class. K-fold cross-validation is typically performed with k = 5 or k = 10,
as there is empirical evidence that these values are associated with good bias-variance trade-offs [22].
Furthermore, these values avoid expensive computational costs which may result from high values of k.
5.4.4 Feature Selection
Feature selection refers to the process of eliminating predictor variables from a data set which do not
contribute to a model’s predictive performance. The motivation for reducing the number of features in
a data set typically fall into one of two categories. The first of these is aimed at improving a models
performance by eliminating predictor variables, which may negatively affect a models performance.
Support vector machines are sensitive to irrelevant predictors, and a model’s performance may suffer if
they are not removed. A further consideration is that of highly correlated predictors, which can negatively
impact models such as logistic regression. A second reason for the removal of predictor variables is
motivated by the aim of reducing a models complexity. Both of the above mentioned motivating factors are
highly desirable to this thesis. More specifically, the elimination of irrelevant or highly correlated variables
is desirable for the reason that it may improve predictive performance for the logistic regression and SVM
67
models. Alternatively, the elimination of predictors that do not compromise a model’s performance is
highly desirable for the reason that a reduction in a models complexity may significantly improve its
interpretability. This is particularly relevant for random forest and decision-tree models [29].
Feature selection techniques can be grouped into three classes, namely, intrinsic methods, filter methods, or
wrapper methods. For intrinsic methods, feature selection forms a part of the modelling process, and there
is a direct connection between selecting features and the statistic which the model attempts to optimise. In
filter methods, a single supervised search is performed to determine which predictors are of importance.
Wrapper methods take an iterative approach by providing a predictor subset to the model and using its
evaluation as the selection criteria for the next subset [29].
This thesis uses a wrapper method known as recursive feature elimination (RFE) to remove potentially
irrelevant features from the injury/workload data set. RFE is an example of a greedy wrapper that
eliminates features in each iteration to achieve the best immediate results. The process begins by ranking
each predictor with a measure of importance. Each successive round removes one or more features of low
importance before performing the ranking procedure on the new subset of features. Upon termination of
the process, a ranking of importance for all prediction variables has been calculated. The subset used for
model building is thus a selection of the predictors with the highest importance ratings [29].
5.4.5 Features From Previous Work
As already mentioned, this thesis is largely inspired by previous work, which was able to predict the onset
of injury among football players with reasonable success [39]. The results from the work mentioned above
are achieved using data gathered from a single professional football team during one season of training
and competition. A subset of only three features was identified as relevant for the prediction of injury. The
features identified were the estimated weighted moving average of a player’s high-speed running distance
(HSREW MA ), the mean standard deviation ratio of a player’s total distance (Dist MSW R ), and the estimated
weighted moving average of a previous injury feature (PI EW MA ). Part of this thesis uses a combination of
techniques and features from the work mentioned above, to test their reproducibility on a new set of data.
There are, however, several important differences between the data set presented in this study and the data
set from the study mentioned above. The differences noted are as follows:
• The previous study used an NCTL definition of injury, whereas this study takes into consideration
both NCTL and NC injury definitions.
• The previous study reported 21 incidents of NCTL injury, whereas this study has only 13.
• The previous study reports 931 individual sessions, whereas this study includes 2848 individual
sessions.
5.4.6 Modelling Approach
Four modelling approaches are taken to compare the effects of the data processing techniques discussed
above. All four classifiers and both NC and NCTL data sets are used in each of the four approaches.
Starting with a simple naive approach, each successive approach incorporates additional techniques in an
attempt to improve model performance and gain insight into the effects of a feature’s contribution to the
incidence of injury. Each approach involves building models using one portion of the data set and testing
68
them on the remaining portion. Data sets are stratified to ensure the correct representation of the injury
class in both training and testing data sets. Additionally, all approaches are subjected to 1000 iterations
of training and testing, and results are reported as an average of the 1000 performance measurements.
Following is a description of the different approaches.
Naive Approach
The first of the four approaches involves building and testing models using each of the two data sets
without modifying the data. In each iteration, the data is split into a test and a training set. Models are then
built using the training data, after which they are evaluated on the designated test data. The splitting ratio
differs for the two data sets to ensure that test sets for both NC and NCTL data sets contain approximately
the same number of test examples. A train-test ratio of 80:20 is used for the NCTL data, whereas a ratio of
55:45 is used for the NC data. These ratios ensure approximately six to seven test cases for both sets of
data. An illustration of the test-train split for the NC data is presented in Figure 5.7.
Figure 5.7: An illustration of a test/train split
ADASYN and Stratified K-Fold Cross Validation

The second approach aims to compensate for the class imbalance by generating synthetic data. Additionally,
stratified k-fold cross validation is used in an attempt to avoid models overfitting to the training set. The
approach is presented in Figure 5.8, which illustrates how ADASYN and stratified 5-fold cross validation
are used in combination with one another. From the figure, it can be seen that the initial data set consists
of 2814 sessions, which are not associated with injury and 34 sessions, which are associated with an injury.
The data is then split into five folds of approximately equal injury to non-injury ratios. In the next step, one
fold is selected as a test set, while the remaining folds are used together as the training set. Before training
is initiated, ADASYN is used to balance the training set by generating synthetic data for the injury class.
The resulting training set thus has an approximately equal number of examples of injury and non-injury.
Once training is completed, a model’s performance is tested using the test fold. The step involving the
selection of a testing set and the generation of synthetic data is repeated five times until all of the five folds
have served as the test set. Performance measurements (discussed in the evaluation phase) are calculated
as an average of the measurements from the five steps.
The entire process is repeated 1000 times for each of the four models and both of the two data sets. To
maintain a similar number of injury test cases among all test sets, the values of k is set to five for NC data
and two for NCTL data.
69
Figure 5.8: An illustration of 5-Fold Cross Validation with ADASYN
70
Previously Identified Features

The third approach is almost identical to that of the second, with the exception being the choice of predictor
variables included in the data set. A subset consisting of the predictor variables presented in Section 5.4.5
is used to determine whether they influence a models ability to detect injury.
Feature Elimination
The final approach combines the second approach described above with the process of feature elimination.
This is essentially a replication of the entire modelling approach taken in the study presented in Section
5.4.5. There are two objectives to this approach. The first of these is aimed at improving a model’s
performance by removing features that may negatively affect the model’s ability to learn. The other
involves identifying a combination of features that are associated with an injury.
Due to there being too few cases of injury in the NCTL data set, the feature elimination approach is
carried out using only the NC data set. As with the approaches presented above, performance measures are
reported as an average of 1000 iterations. The entire approach is illustrated in Figure 5.9 and summarised
as follows:
1. Step 1 involves splitting the data set. A 30 : 70 split ratio is used, with 30% of the data being used
for the first 3 steps, while 70% is reserved for constructing and testing prediction models.
2. In Step 2, ADASYN is used to correct the class imbalance by generating synthetic data for the
minority class. In the case of the NC injury data, this results in approximately 844 examples
represented for each class.
3. Feature selection is then applied in Step 3 using RFECV, and the predictor variables most relevant
for classification are recorded. RFECV is performed using a Decision Tree classifier, the number of
folds is set to k = 5, and the performance measure is specified as AUC ROC.
4. Step 4 marks the beginning of the second phase of the modelling approach and involves building
and testing models using the second portion of data from the split in Step 1.
5. In Step 5, predictor variables identified as irrelevant in Step 3 are removed from the data set.
6. In Step 6, the filtered data is used to perform stratified k-fold cross validation using ADASYN to
compensate for class imbalance in the training set. The process is identical to that illustrated in
Figure 5.8, with the exception being that k = 3 is used.
71
Figure 5.9: An illustration of Feature Elimination combined with K-Fold Cross Validation and ADASYN
5.5 Evaluation
5.5.1 Evaluation Metrics
The use of common performance measures such as accuracy or error rate may produce misleading results
when dealing with unbalanced data sets due to their dependence upon class distribution [19]. This can be
illustrated using the example of the NC injury data set. As the injury class represents approximately 1% of
the data, a classifier may miss every case of injury and still achieve an accuracy of 99%. An additional
issue that is associated with common performance measurements is that of the cost of misclassification.
As it is often more important to be able to identify the minority class than the majority class, the cost
of misidentifying the minority class will have a more significant consequence [10]. This is of particular
importance for this study, which places more emphasis on a classifier being able to identify cases of injury
correctly. It is for this reason that the performance measures used in this thesis are addressed towards
quantities that are class independent.
As there appears to be no consensus among previous studies as to which performance measures should
be used [39; 5; 23], this thesis makes use of a variety of widely used evaluation techniques. Table 5.5
presents a confusion matrix that forms the basis of the majority of measures used in this study. In binary
72
classification problems, predictions fall into one of two categories, either positive or negative. This gives
rise to four possible outcomes, namely, True Positives, False Positives, True Negatives, and False Negatives.
The values of these outcomes are used for calculating the measures presented below.
Predicted Class
Positive Negative
Actual Class
Positive TP FN
Negative FP TN
Table 5.5: A confusion matrix for binary classification
Precicion represents the fraction of cases classified as positive that are, in fact, positive [33]. It serves as
is an indication of a models trustworthiness. In the case of injury prediction, a high precision would result
in the model identifying injuries with a great deal of certainty. A low precision, however, would result in
the model raising a lot of false alarms.
TruePositives
Precision = (5.6)
Truepositives + FalsePositives
Recall is an indication of a model’s ability to identify the injury class. The higher the recall, the better
a model’s ability to detect the injury class. A low recall would result in an inability to predict a player
getting injured.
TruePositives
Recall = (5.7)
Truepositives + FalseNegatives
The F1 Score is a harmonic mean between recall and precision, and combines the two into one measure.
precision ∗ recall
F1 − Score = 2 ∗ (5.8)
precision + recall
Area under the curve (AUC): The Receiver Operating Characteristics (ROC) is a frequently used tool
for evaluating the performance of a classifier in the presence of unbalanced data. It is a graphical plot of
a classifiers true positive rate (recall) versus its false positive rate at all classification thresholds. Better
performance is associated with steeper curves, whereas a models inability to differentiate between classes
would result in a diagonal curve from the bottom left to the top right of the plot [33]. The area under the
curve (AUC) provides an aggregate measure of performance, which measures the entire two-dimensional
area under the ROC curve. Scores close to 1 are associated with good performance, whereas scores around
0.5 are equivalent to random guessing.
73
5.5.2 Results
A summary of all results is presented in Table 5.6.
Data Set Classifier Injury Precision Recall F1 AUC

NC 0.00 0.00 0.00 0.50
Naive Approach SVM
NCTL 0.00 0.00 0.00 0.50
NC 0.00 0.00 0.00 0.50
LR
NCTL 0.00 0.00 0.00 0.50
NC 0.00 0.00 0.00 0.50
RF
NCTL 0.00 0.00 0.00 0.50
NC 0.02 ±0.05 0.01 ± 0.04 0.01 ± 0.04 0.50 ± 0.02
DT
NCTL 0.01 ± 0.03 0.00 ± 0.03 0.01 ± 0.03 0.50 ± 0.01
NC 0.52 ± 0.27 0.02 ± 0.01 0.03 ± 0.02 0.50 ± 0.01
SVM
ADASYN NCTL 0.08 ± 0.08 0.00 ± 0.00 0.00 ± 0.00 0.50 ± 0.00
& NC 0.44 ± 0.24 0.02 ± 0.01 0.03 ± 0.02 0.50 ± 0.01
LR
Stratified NCTL 0.38 ± 0.05 0.01 ± 0.00 0.02 ± 0.01 0.50 ± 0.00
K-Fold CV NC 0.06 ± 0.09 0.02 ± 0.03 0.03 ± 0.04 0.50 ± 0.01
RF
NCTL 0.04 ± 0.01 0.02 ± 0.03 0.03 ± 0.04 0.51 ± 0.01
NC 0.10 ± 0.02 0.02 ± 0.01 0.03 ± 0.02 0.50 ± 0.01
DT
NCTL 0.08 ± 0.09 0.02 ± 0.02 0.02 ± 0.03 0.51 ± 0.01
NC 0.45 ± 0.24 0.01 ± 0.01 0.02 ± 0.01 0.50±0.01
SVM
Previously NCTL 0.45 ± 0.12 0.00 0.01 0.50
Identified NC 0.33 ± 0.18 0.01 ± 0.01 0.02 ± 0.01 0.50 ± 0.01
LR
Features NCTL 0.46 ± 0.04 0.01 0.01 0.50
NC 0.15 ± 0.07 0.02 ± 0.01 0.02 ± 0.01 0.50 ± 0.00
RF
NCTL 0.17 ± 0.08 0.01 ± 0.01 0.01 ± 0.01 0.50
NC 0.17 ± 0.08 0.02 ± 0.01 0.03 ± 0.03 0.50 ± 0.01
DT
NCTL 0.19 ± 0.10 0.01 ± 0.01 0.02 ± 0.01 0.50
Feature NC 0.65 ± 0.17 0.02 ± 0.01 0.04 ± 0.01 0.51 ± 0.01
SVM
Selection NCTL
& NC 0.48 ± 0.10 0.02 ±0.01 0.04 ± 0.02 0.51
LR
ADASYN NCTL
& NC 0.06 ± 0.08 0.02 ± 0.02 0.03 ± 0.04 0.50 ± 0.01
RF
Sratified NCTL
K-Fold CV NC 0.08 ± 0.13 0.01 ± 0.01 0.02 ± 0.02 0.50 ± 0.01
DT
NCTL
Table 5.6: Summary results of all models
74
An explanation of the results seen in the Table 5.6 is provided below.
Naive Approach
It can be seen from Table 5.6 that by not employing techniques to account for class imbalance and
collinearity present among the data sets, the performance achieved by all models is extremely poor. Apart
from decision trees, all of the models achieved precision, recall, and F1 scores of 0.00 for both definitions
of injury. DT models are the only models able to produce prediction scores above zero, which appear
to be marginally higher for the case of NC injuries when compared to NCTL injuries. The performance
measures for both data sets are associated with a high degree of variation. AUC scores of 0.50 are achieved
for all models, which indicates that models are unable to distinguish between injury and non-injury classes.
Prediction of injury is thus considered equivalent to guessing.
Table 5.9 presents the confusions matrices for SVM and DT models after being tested on NC data. These
are identified as the poorest and the best performers, respectively. Here it is evident the SVM models
create decision boundaries, which classify all test data as belonging to the non-injury class. Hence, no
predictions of the injury class are seen. The decision boundaries created by the DT models, however, do
result in positive predictions with a very low degree of precision. Here it can be seen that only 117 of 8797
predictions of injury are, in fact, true cases of injury.
Predicted Class Predicted Class

Positive Negative Positive Negative
Actual Class
Actual Class
Positive 0 7000 Positive 117 6883
Negative 0 563000 Negative 8680 554320
Table 5.7: Confusion matrix for SVM Table 5.8: Confusion matrix for DT
Table 5.9: A comparison of confusion matrices for SVM and DT models using NC data
ADASYN and Stratified K-Fold Cross Validation

The introduction of cross validation and synthetic data generation produces higher precision, recall, and F1
scores for all models and both data sets. The most notable performance improvement is observed among
precision scores. SVM models are able to achieve the highest average precision score of 0.52 for NC
injuries, a drastic improvement from the average precision score achieved using the naive approach. The
SVM models are, however, unable to reproduce these results for NCTL injuries, for which an average
precision score of 0.08 is achieved. All models achieve better precision results in the case of NC injuries,
and hence fewer false alarms are raised for this definition of injury.
Recall scores are low for all models. Low recall scores are the result of a high proportion of false negatives,
meaning that models are unable to identify incidents of injury. These lower values of recall are responsible
for the poor F1 scores seen for all models. It could be said that F1 scores are marginally better in the case
of NC injuries, but the values are so low that they are of no significant importance.
75
(a) NC (b) NCTL
Figure 5.10: Boxplots comparing the area under the curve for models using ADASYN and Stratified
K-Fold Approach
A comparison of AUC scores for NC and NCTL injuries is presented in Figure 5.10 using boxplots. Here
it can be seen that there is minimal variation between scores for both NC and NCTL injuries. An average
AUC score of 0.50 indicates that none of the models are able to distinguish between classes, despite
attempts being made to compensate for imbalance and overfitting.
Previous Identified Features

The reduction of the feature set does not result in improved performance, with the exception of several
results seen among precision scores. In particular, models built using the NCTL data set show an
improvement in precision scores for all models. Despite their improvement, none of the precision scores
exceed 0.50 and is by no means a trustworthy indicator for injury. Furthermore, recall scores remain low,
highlighting the models’ inability to recognise incidents of injury.
Figure 5.11 presents a comparison of AUC scores for models tested on both the NC and NCTL data sets.
Near-perfect alignment of AUC scores around a value of 0.50 illustrates that models are unable to learn
from the data. When compared to the boxplots presented in Figure 5.10, it can be seen that the subset of
previously identified features does not significantly influence the median AUC scores. The only noticeable
difference is that of the variance, which is slightly higher for the full data set when compared to the reduced
data set. The difference in variance may be explained by more complex decision boundaries which result
from a larger feature set.
76
(a) NC (b) NCTL
Figure 5.11: Boxplots comparing the area under the curve for models using the previously identified injury
features
Feature selection
The introduction of RFECV produces marginally better results for both the SVM and LR models, whereas
no improvement in performance is seen for the RF and DT models. By eliminating "irrelevant" features,
the fourth modelling approach taken in this thesis results in the highest precision and F1 scores for both
SVM and LR models. Despite these slight improvements, the approach is unable to produce results of
any significance, and recall and AUC scores remain very low. A comparison of the AUC scores achieved
using the feature elimination modelling approach is present in Figure 5.12. As seen with all modelling
approaches used in this thesis, low scores of around 0.50 are achieved by all models. This illustrates the
models’ inability to learn from the data, despite the elimination of "irrelevant" features.
Figure 5.12: Comparison of AUC scores for models using RFECV in combination with ADASYN and
Stratified K-Fold CV
77
Figure 5.13 presents eight graphs taken from separate runs of the RFECV algorithm, where each graph
plots the number of features against a DT classifier’s prediction performance. The graphs are taken from
separate iterations of the feature elimination approach and highlight how the number of features required
to achieve maximum performance varies from one iteration to another. The figure presents graphs in which
four to eleven features are considered optimal for maximising a classifier’s performance. The high degree
of variation can be explained by the similarity in the shape of the graphs. Here it is seen that classification
performance is initially poor when the number of features is small. The performance rapidly improves
as the number of features increases, after which it plateaus once approximately four to five features have
been reached. The flattening of the curves represents the algorithm’s inability to identify a single subset of
features that can be considered optimum, hence the significant variation in the number of features selected.
Ideally, a decrease in performance should be seen after the optimal number of features has been reached,
resulting in a graph with a clear peak and the curve falling away on both sides.
From 1000 iterations of RFECV, three features are common to all sets of selected features. These are
the V5 distance of a player, the mean standard deviation ratio of a player’s V4 distance, and a player’s
previous injury feature. Despite these three features being common to all feature sets, they are never
selected without the inclusion of additional features. The additional features, however, differ, which may
be caused by several factors. The first of these may be attributed to the high degree of correlation among
many of the features. A high correlation between features may result in a feature being selected in one
iteration, but being excluded in another iteration as a result of another highly correlated feature being
selected instead. Another factor that influences the variability of the number of features selected is the
fact that only 30% of the data set is used for RFECV. As a result of the combination of a small number
of injuries and a low degree of correlation between the workload features and injuries, it is logical that
the features will differ from one iteration to another. The last factor involves the use of a decision tree
classifier in the RFECV process. Decision trees are susceptible to overfitting, which may also contribute to
the high degree of variability between iterations. An alternative option would be to use a random forest
classifier which is less susceptible to overfitting.
5.5.3 Discussion of Results
Models of the relationship between GPS workloads, a previous injury feature, and injury shows lim-
ited ability to predict future injury among players from a professional Norwegian football team. Mean
AUC scores are below 0.52 for all modelling approaches indicating that injury predictions are no better
than those expected by random chance. Precision scores are higher than recall scores for all modelling
approaches, with a highest score of 0.65 ± 0.17 being achieved by the feature selection approach. The
inability to achieve recall scores higher than 0.02 means that all modelling approaches result in a large
number of false-negative predictions, indicating that models are unable to identify the injury class. The
use of different definitions of injury does not improve model performance and supports evidence from a
similar study on Australian football players [5]. The naive approach adopted in this study produced the
worst results, indicating that the techniques adopted to compensate for unbalanced data, and irrelevant
features improve performance. Despite the use of previously identified features showing no improvement
in predictive performance, 1000 iterations of feature elimination result in a subset of three features being
selected in every iteration. One of these features, a previous injury feature, is identical to that identified in
previous work and suggests that it may be a contributing factor to the risk of injury [39].
78
(a) Four Features (b) Five Features
(c) Six Features (d) Seven Features
(e) Eight Features (f) Nine Features
(g) Ten Features (h) Eleven Features
Figure 5.13: Comparison of feature selection
79
Several limitations that may negatively impact the prediction models’ performance are identified. The
amount of data collected in this study is significantly less than that collected in many other studies of a
similar nature, which typically include a much larger number of players, multiple seasons of data recording,
or a combination of both [40; 5; 23; 45; 38]. The limited size of the data set may severely impair a
models ability to generalise cases of injury, and models run the risk of overfitting to the training set. It
has been proposed that more than ten seasons of data is needed to create reliable prediction models [5].
Two strategies can be used to increase the size of the data set. One of these involves the inclusion of
additional teams and may be achieved by the sharing of data between football clubs. An advantage of
this strategy is that a wide variety of players and workloads will improve a models generalisation ability.
Additionally, a large amount of data can be collected over a shorter time. The complexity of coordinating
a larger scale project is, however, a distinct disadvantage of the above strategy. Another strategy is to
continue collecting data during future seasons. The strategies mentioned above are not mutually exclusive
and may be implemented together.
In addition to the limited amount of data, this study has a very low number of injuries with respect to
the size of the data set. Using the example of NCTL injuries, similar studies report incidents of injury
accounting for 1.7 and 2.2 percent of the data set [5; 39]. In contrast, NCTL injuries account for only
0.5 percent of the entire data set in this study. Severe class imbalance negatively affects the quality of
data generated in the process of synthetic data generation, which again may impair a models ability to
generalise. There is little that can be done about the low number of injuries. However, injuries that are
excluded from the data set due to being classified as pre-season injuries, may be incorporated in future
studies if off-season training loads are to be recorded. Additionally, by incorporating more clubs into the
study, and by continuing to collect workload and injury data over multiple seasons, the number of injuries
will naturally increase.
Another possible limitation encountered in this study is that of the definition of injury. All NC and NCTL
injuries which occur in-season are included among the injury data regardless of whether they took place
during a session or not. As several incidents of injury occurred outside of training/competition, they may
not be directly associated with previous workloads, and for this reason, negatively affect a models ability
to differentiate between classes of injury and non-injury. Similarly, more specific definitions of injury, such
as hamstring injury, have been shown to produce better prediction performance [5]. Hamstring injury is,
however, not used in this study due to an insufficient number of cases. Again the collection of additional
data using the strategies mentioned above may provide the data necessary to include hamstring injuries in
future studies.
Finally, GPS data is also considered a possible limitation in this study. As discussed in the data under-
standing phase, multiple GPS recordings fall outside of the range of what is expected. This highlights the
fact that GPS devices are susceptible to providing inaccurate workload recordings. Despite efforts being
made to correct the more apparent outliers, there is a high probability that several more subtle recording
inaccuracies remain unchanged. An obvious solution to the problem is to replace the GPS units with
newer models. This is, however, costly and may not be a viable alternative. Another option is to monitor
GPS recordings more frequently as opposed to at the end of the season after the ETL process has been
performed. More frequent checking by coaching staff would provide researchers with a clearer indication
of which recording needs to be corrected, and thus limiting potential errors.
80
Chapter 6
Conclusion
Until recently, the majority of research on relationships between athlete workloads and injury has been
limited to univariate studies. However, the emergence of machine learning has given rise to numerous
multivariate studies. The work covered in this thesis forms part of a partnership between the University of
Oslo (UIO) and the Norwegian School of Sports Science (NIH), which aims to study relationships between
athlete workloads, injury and illness. With this thesis serving as the first work undertaken in the partnership,
two goals are identified. The first is to build a data warehouse to store all available training, competition,
injury and illness data generated by a professional Norwegian football team. The data warehouse serves as
a unified representation of the clubs data, providing data for both the second goal of this thesis, as well as
future research. The second goal is to conduct a data mining study using player workload and injury data
to predict future injury.
This chapter begins in Section 6.1 with a summary of the data warehouse implemented in this thesis. It
then proceeds to Section 6.2 which provides a summary of the data mining study and a brief discussion
model performance. Lastly, Section 6.3 provides direction for the future work of the project.
6.1 Data Warehouse Summary
The choice of a data warehouse is motivated by three factors. These are the user needs, the granularity of
the source data, and the update frequency of the datastore. Data warehouses specialise in handling the
types of query intended for this project, which are expected to be relatively complex and access a large
number of tuples. A data warehouse is also well suited to working with multiple levels of granularity using
OLAP functions such as roll-up and drill-down. The GPS data for this project is exported at a granularity
of one-second intervals, providing the opportunity for multiple levels of aggregation, thus making a data
warehouse a logical alternative for its storage. Update policies for data warehouses are typically less
frequent than those of traditional database management systems. The data for this project is made available
from CSV file exports, a process which is carried out infrequently due to the timeliness of the process.
Data updates are thus infrequent, making the data well suited for a data warehouse.
The three data sources available to this project are GPS workload data, a training log completed by
players after each session, and an injury/illness log completed by the club’s medical staff. Workload data
includes four distance features, namely total distance, V4 distance, V5 distance, and HSR distance, and two
acceleration features, namely acceleration load, and player load. Injury and illness data contains important
81
CHAPTER 6. CONCLUSION
information such as the date and type of injury/illness incurred by a player, as well information about its
severity, mechanism, duration, and the number of days a player is absent from training/competition. Log
data summarises the difference between a player’s planned and actual workloads for each session, as well
as the difference between a player’s planned and actual enjoyment for each session.
Data warehouse design is divided into four phases using an analysis/source-driven approach. This
approach takes into account both the analysis needs of users as well as the data which are available from
the underlying source systems. The process begins with the requirements specification phase, in which
the data required to achieve project goals, as well as several data warehouse components are identified.
This is followed by the conceptual design phase, which uses information from the requirements phase
to generate a single conceptual schema to facilitate communication between designers and users. Two
modelling issues are identified in this phase. The first of these is many-to-many dimensions, which result
from the possibility of players being injured and ill at the same time. It is resolved by splitting injury and
illness into two separate dimensions. The other is that of unbalanced hierarchies, which are present in
both the Injury and Illness dimensions, and is resolved by using placeholders for missing levels in each of
the hierarchies. A star schema comprised of a single fact table and six dimensions is then generated in
the logical design phase using a ROLAP approach. The fact table includes a foreign key for each of the
six dimension tables, the six workload features represented as measures, a date-time stamp and a unique
identifier which serves as the tables primary key. The date-time stamp supports precise time calculations
at finer levels of granularity. It is included in the fact table as opposed to a separate date-time dimension
to prevent dimension tables from becoming excessively large. The time hierarchy is thus represented as
a combination of a Date dimension, a Session dimension and the date-time stamp in the fact table. This
provides users with the ability to aggregate workload data from the finest granularity at the second-level,
up the level of an entire season. The other dimensions included in the schema are the Player, Training Log,
Injury and Illness dimensions.
In the physical design phase, B-Tree indexes are created for all primary keys as well as the session-id
and player-id columns in the fact table. B-tress indexes are used as an alternative to join indexes due to
there being no support for the latter in PostgreSQL. A materialised view is also created in this phase and
includes workload and injury data aggregated at the session-level. The data included in the material view
is selected specifically for the analysis needs of the data mining completed in this thesis and does not take
into consideration the analysis needs of future work.
6.2 Data Mining Summary
In the second half of this thesis, a data mining study is conducted using the CRISP-DM process model.
With the aim of predicting future injury from workload data, five phases from this framework are carried
out, namely business understanding, data understanding, data preparation, modelling, and evaluation. Four
objectives are identified in the business understanding phase, all of which aim to predict future injury in
professional football players using workload and injury data gathered from a single season of training and
competition. The first and primary objective is to predict future injury, while the three remaining objectives
investigate whether altering the definition of injury, using a subset of previously identified features, or
using a process of feature elimination improve a models prediction performance. Two definitions of injury
are provided in this phase, namely non-contact injuries (NC) and non-contact causing time loss (NCTL).
Furthermore, all data is aggregated at the session-level.
82
Exploratory data analysis (EDA) is performed in the data understanding phase using injury and workload
data from the materialised view created in the first half of this thesis. GPS recording errors are identified,
and 174 sessions are corrected using a smoothing by bin means approach. Additionally, several injuries
are identified as occurring before the start of the season and are considered irrelevant for the study. Player
workloads of injured and non-injured players are compared for both definitions of injury. The most notable
difference is observed between the HSR distances of non-injured players and players with NC injuries.
However, the differences are insignificant and do not provide any insight into possible relationships
between player workloads and player injury.
The data preparation phase involves the creation of the final data set to be used for predictive modelling. It
consists of 19 features and a label indicating whether a player received an injury after the current session.
The 19 features consist of the six GPS workload features, six estimated weighted moving average (EWMA)
features, six mean standard deviation (MSWR) features, and a previous injury feature which represents the
number of injuries a player has received to date as well as the number of days since returning to training
after recovering from their last injury. Two data sets are created, one for each definition of injury, each of
which contains data from 2848 sessions and 32 players. The NC and NCTL data sets contain 34 and 13
injury labels, respectively.
Four models of varying complexity and interpretability are used in the modelling phase. The models
included in this phase are decision trees (DT), random forests (RF), logistic regression (LG), and support
vector machines (SVM). Furthermore, each model is used in four modelling approaches which aim to
achieve the objectives identified in the business understanding phase. The first of these is a naive approach,
which adopts a classic train-test split approach. The second and third approaches make use of stratified
k-fold cross validation and synthetic data generation using ADASYN to improve generalisation and
account for the class imbalance present in the data set. The difference between the two approaches is that
approach two make use of the entire data set, whereas approach three uses a subset of previously identified
features, namely a player’s high-speed running distance (HSREW MA ), the mean standard deviation ratio of a
player’s total distance (Dist MSW R ), and the estimated weighted moving average of a previous injury feature
(PI EW MA ). The fourth approach uses 30% of the data set to eliminate features which do not contribute to
a models predictive performance. Recursive feature elimination with cross validation (RFECV) is used
for this process. Irrelevant features are removed from the remaining data set before using ADASYN to
correct class imbalance and stratified k-fold cross validation to improve generalisation. Each approach is
subjected to a 1000 iterations of training and testing. Performance metrics are reported as an average of
test results from all iterations.
Finally, the evaluation phase involves evaluating a models predictive performance for each of the four
approaches. Four class independent evaluation metrics are used to assess model performance, namely,
precision, recall, F1 score and area under the curve (AUC). Models of the relationship between GPS work-
loads, a previous injury feature, and injury show limited ability to predict future injury among players from
a professional Norwegian football team. Mean AUC scores are below 0.52 for all modelling approaches
indicating that injury predictions are no better than those expected by random chance. Precision scores are
higher than recall scores for all modelling approaches, with a highest score of 0.65 ± 0.17 being achieved
by the feature selection approach. The inability to achieve recall scores higher than 0.02 means that all
modelling approaches result in a large number of false-negative predictions, indicating that models are
unable to identify the injury class. The use of different definitions of injury does not improve model
performance. The previous injury feature is one of three features selected in every iteration of feature
elimination, suggesting a possible association with player injury.
83
6.3 Future Work
As mentioned above, this thesis serves as the first work undertaken in the partnership between UIO and
NIH, and numerous opportunities for the future work of the project are identified. The most important
element of future work is to gather more data and is considered essential for achieving good results in
the field of injury prediction. This topic is discussed in Section 5.5.3, which highlights two approaches
for attaining more data. The first is to include data from additional clubs and has the advantage of
accumulating more data over a shorter period. Additionally, data from a large number of players may
improve the generalisation ability of predictive models. A potential drawback to this approach is that of
added complexity. Increasing the number of clubs is likely to result in heterogeneity among data sources,
something which needs to be taken into consideration in the ETL process. Another approach to gathering
data is to continue collecting data from the same club over multiple seasons. This approach is advantageous
in that the data warehouse and ETL processes are already in place, and no additional consideration to
new data formats is required. The disadvantage of this approach is that it will take many years before a
substantial amount of data is collected. The two approaches are, however, not mutually exclusive and may
be used together.
The data set used for predictive modelling in this thesis is limited to the inclusion of two relative workloads,
namely MSWR and EWMA. There is strong evidence that spikes in acute:chronic workload ratios (ACWR)
are associated with increases in team injury rates [20; 32]. ACWR workloads are not included in this
study due to their calculation requiring 30 days of accumulated workload data. For this reason, the first
30 days since a player’s first session cannot be included in the data set and result in a drastic reduction
in the size of the data set used for modelling. ACWR workloads may, however, provide very useful for
future work if the data set is to increase in size. Furthermore, the inclusion of physical measures of fitness,
motor coordination, and neuromuscular measurements have shown promising results in the field of injury
prediction [38; 2].
The data warehouse provides several opportunities for improving query performance, simplifying queries
and automating ETL processes. The current ETL process is both complex and time-consuming. One of the
significant constraints with the current process is that Catapult GPS data is loaded from CSV files. The
generation of a CSV file is time-consuming and subject to human error. Additionally, CSV files need to
be cleaned and separated by player before their data can be loaded into the warehouse. An alternative
approach is to load data using an open API provided by Catapult Sports. Such an approach eliminates
timely and complex tasks and enables secure batch loading of the data warehouse with the update frequency
being decided by users.
The introduction of injury definitions into the Injury dimension may serve to simplify queries further.
The process of deciding which injuries are to be included in a given definition is a manual task involving
careful inspection of each incident of injury. The lists produced from these tasks are used in SQL queries
which are used in the data mining chapter of this thesis. By adding columns to the Injury dimension, which
classify an injury according to definitions specific to analysis tasks, SQL queries can be simplified.
Another improvement which needs to be considered as the data warehouse increases in size is that of
physical modelling. This topic is briefly discussed in this thesis but requires additional work to ensure
optimum performance for larger amounts of data. Firstly, indexing techniques such as join indexes are not
84
supported by PostgreSQL but may be implemented manually by data warehouse designers. Additionally,
BRIN indexes may be considered for the Injury dimension. Further performance considerations include
partitioning and view maintenance.
85
86
Bibliography
[1] L. Anderson, T. Triplett-McBride, C. Foster, S. Doberstein, and G. Brice. Impact of training patterns
on incidence of illness and injury during a women’s collegiate basketball season. Journal of strength
and conditioning research / National Strength Conditioning Association, 17:734–8, 12 2003.
[2] F. Ayala, A. Lopez-Valenciano, J. Martín, M. De Ste Croix, F. Vera-Garcia, M. García-Vaquero,

I. Ruiz-Pérez, and G. Myer. A preventive model for hamstring injuries in professional soccer:
Learning algorithms. International Journal of Sports Medicine, 40, 03 2019.
[3] D. Berrar. Cross-Validation. 01 2018.
[4] M. S. Brink, C. Visscher, S. Arends, J. Zwerver, W. J. Post, and K. A. Lemmink. Monitoring stress
and recovery: new insights for the prevention of injuries and illnesses in elite youth soccer players.
British Journal of Sports Medicine, 44(11):809–815, 2010.
[5] D. Carey, K.-L. Ong, R. Whiteley, K. Crossley, J. Crow, and M. Morris. Predictive modelling of
training loads and injury in australian football. International Journal of Computer Science in Sport,
17, 06 2017.
[6] J. Claudino, D. Capanema, T. Souza, J. Serrão, A. Machado Pereira, and G. Nassis. Current
approaches to the use of artificial intelligence for injury risk assessment and performance prediction
in team sports: a systematic review. Sports Medicine - Open, 5(1):1–12, 2019.
[7] T. Eckard, D. Padua, D. Hearn, B. Pexa, and B. Frank. The relationship between training load and
injury in athletes: A systematic review. Sports Medicine, 48:1–33, 06 2018.
[8] F. Ehrmann, C. Duncan, D. Sindhusake, W. Franzsen, and D. Greene. Gps and injury prevention in
professional soccer. Journal of strength and conditioning research / National Strength Conditioning
Association, 30, 07 2015.
[9] J. Ekstrand, M. Hägglund, and M. Waldén. Injury incidence and injury patterns in professional
football: the uefa injury study. British Journal of Sports Medicine, 45(7):553–558, 2011.
[10] C. G. Weng and J. Poon. A new evaluation measure for imbalanced datasets. volume 87, pages
27–32, 01 2008.
[11] T. Gabbett. Reductions in pre-season training loads reduce training injury rates in rugby league
players. British journal of sports medicine, 38:743–9, 12 2004.
[12] T. Gabbett. The training-injury prevention paradox: Should athletes be training smarter and harder?
British journal of sports medicine, 50, 01 2016.
87
[13] T. Gabbett and S. Ullah. Relationship between running loads and soft-tissue injury in elite team sport
athletes. Journal of strength and conditioning research / National Strength Conditioning Association,
26:953–60, 02 2012.
[14] M. Hägglund, M. Waldén, R. Bahr, and J. Ekstrand. Methods for epidemiological study of injuries
to professional football players: developing the uefa model. British Journal of Sports Medicine,
39(6):340–346, 2005.
[15] M. Hägglund, M. Waldén, H. Magnusson, K. Kristenson, H. Bengtsson, and J. Ekstrand. Injuries

affect team performance negatively in professional football: an 11-year follow-up of the uefa
champions league injury study. British Journal of Sports Medicine, 47(12):738–742, 2013.
[16] S. Halson. Monitoring training load to understand fatigue in athletes. Sports medicine (Auckland,
N.Z.), 44, 09 2014.
[17] J. Han, M. Kamber, and J. Pei. 1 - introduction. In J. Han, M. Kamber, and J. Pei, editors, Data
Mining (Third Edition), The Morgan Kaufmann Series in Data Management Systems, pages 1 – 38.
Morgan Kaufmann, Boston, third edition edition, 2012.
[18] H. He, Y. Bai, E. A. Garcia, and S. Li. Adasyn: Adaptive synthetic sampling approach for imbalanced
learning. In IJCNN, pages 1322–1328. IEEE, 2008.
[19] H. He and E. Garcia. Learning from imbalanced data. Knowledge and Data Engineering, IEEE
Transactions on, 21:1263 – 1284, 10 2009.
[20] B. Hulin, T. Gabbett, D. Lawson, P. Caputi, and J. Sampson. The acute: Chronic workload ratio
predicts injury: High chronic workload may decrease injury risk in elite rugby league players. British
Journal of Sports Medicine, 0:1–7, 10 2015.
[21] M. Hägglund, M. Waldén, and J. Ekstrand. Previous injury as a risk factor for injury in elite football:
A prospective study over two consecutive seasons. British journal of sports medicine, 40:767–72, 09
2006.
[22] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: With
Applications in R. Springer Publishing Company, Incorporated, 2014.
[23] A. Jaspers, T. Beéck, M. Brink, W. Frencken, F. Staes, J. Davis, and W. Helsen. Relationships
between the external and internal training load in professional soccer: What can we learn from
machine learning? International Journal of Sports Physiology and Performance, 13:1–18, 12 2017.
[24] M. Jordan and T. Mitchell. Machine learning: Trends, perspectives, and prospects. Science (New
York, N.Y.), 349:255–60, 07 2015.
[25] C. Julien. What is acceleration load?, 2019.
[26] C. Julien. What is playerload?, 2019.
[27] R. Kimball. Latest thinking on time dimension tables, 2004.
[28] D. Kirkendall and J. Dvorak. Effective injury prevention in soccer. The Physician and sportsmedicine,
38:147–57, 04 2010.
88
[29] M. Kuhn and K. Johnson. Feature Engineering and Selection: A Practical Approach for Predictive
Models. Chapman and Hall/CRC, 2019.
[30] E. Lehmann and G. Schulze. What does it take to be a star? – the role of performance and the media
for german soccer players. Applied Economics Quarterly (formerly: Konjunkturpolitik), 54:59–70,
02 2008.
[31] J. M. Lucas and M. S. Saccucci. Exponentially weighted moving average control schemes: Properties
and enhancements. Technometrics, 32(1):1–12, 1990.
[32] S. Malone, A. Owen, M. Newton, B. Mendes, T. Gabbett, and K. Collins. The acute:chonic workload
ratio in relation to injury risk in professional soccer. Journal of Science and Medicine in Sport, 20,
11 2016.
[33] G. Menardi and N. Torelli. Training and assessing classification rules with unbalanced data. Data
Mining and Knowledge Discovery, 01 2012.
[34] D. Murphy, D. Connolly, and B. Beynnon. Risk factors for lower extremity injury: A review of the
literature. British journal of sports medicine, 37:13–29, 03 2003.
[35] N. B. Murray, T. J. Gabbett, A. D. Townshend, and P. Blanch. Calculating acute:chronic workload

ratios using exponentially weighted moving averages provides a more sensitive indicator of injury
likelihood than rolling averages. British Journal of Sports Medicine, 51(9):749–754, 2017.
[36] D. Pfirrmann, M. Herbst, P. Ingelfinger, P. Simon, and S. Tug. Analysis of injury incidences in male
professional adult and elite youth soccer players: A systematic review. Journal of Athletic Training,
51(5):410–424, 2016. PMID: 27244125.
[37] S. W. Roberts. Control chart tests based on geometric moving averages. Technometrics, 1(3):239–250,
1959.
[38] N. Rommers, R. Rössler, E. Verhagen, F. Vandecasteele, S. Verstockt, R. Vaeyens, M. Lenoir, and

E. D’Hondt. A machine learning approach to assess injury risk in elite youth football players.
Medicine Science in Sports Exercise, page 1, 02 2020.
[39] A. Rossi, L. Pappalardo, P. Cintia, F. M. Iaia, J. Fernãndez, and D. Medina. Effective injury
forecasting in soccer with gps training data and machine learning. PLOS ONE, 13(7):1–15, 07 2018.
[40] J. Ruddy, A. Shield, N. Maniar, M. Williams, S. Duhig, R. Timmins, J. Hickey, M. Bourne, and
D. Opar. Predictive modeling of hamstring strain injuries in elite australian footballers. Medicine
Science in Sports Exercise, 50:1, 12 2017.
[41] T. Scott, C. R. Black, J. Quinn, and A. Coutts. Validity and reliability of the session-rpe method for
quantifying training in australian football: A comparison of the cr10 and cr100 scales. Journal of
strength and conditioning research / National Strength Conditioning Association, 27, 03 2012.
[42] T. Soligard, M. Schwellnus, J.-M. Alonso, R. Bahr, B. Clarsen, H. P. Dijkstra, T. Gabbett, M. Gleeson,
M. Hägglund, M. R. Hutchinson, C. Janse van Rensburg, K. M. Khan, R. Meeusen, J. W. Orchard,
B. M. Pluim, M. Raftery, R. Budgett, and L. Engebretsen. How much is too much? (part 1)
international olympic committee consensus statement on load in sport and risk of injury. British
Journal of Sports Medicine, 50(17):1030–1041, 2016.
89
[43] M. Stein, H. Janetzko, D. Seebacher, A. Jäger, M. Nagel, J. Hölsch, S. Kosub, T. Schreck, D. A. Keim,
and M. Grossniklaus. How to make sense of team sport data: From acquisition to data modeling and
research aspects. Data, 2(1), 2017.
[44] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson Education, 2006.
[45] H. Thornton, J. Delaney, G. Duthie, and B. Dascombe. Importance of various training load measures
on injury incidence of professional rugby league athletes. International Journal of Sports Physiology
and Performance, 12, 10 2016.
[46] A. Vaisman and E. Zimányi. Data Warehouse Systems: Design and Implementation. Springer,
Heidelberg, 2014.
[47] Vassiliadis and A. Simitsis, Panos. Extraction, transformation, and loading, 2009.
[48] L. Wallace, K. Slattery, and A. Coutts. The ecological validity and application of the session-rpe
method for quantifying training loads in swimming. Journal of strength and conditioning research /
National Strength Conditioning Association, 23:33–8, 11 2008.
[49] S. Williams, S. West, M. J. Cross, and K. A. Stokes. Better way to determine the acute:chronic
workload ratio? British Journal of Sports Medicine, 51(3):209–210, 2017.
[50] R. Wirth and J. Hipp. Crisp-dm: Towards a standard process model for data mining. Proceedings of
the 4th International Conference on the Practical Applications of Knowledge Discovery and Data
Mining, 01 2000.
[51] G. Wu and E. Chang. Class-boundary alignment for imbalanced dataset learning. ICML 2003
Workshop on Learning from Imbalanced Data Sets, 01 2003.
90
Appendices
91
Appendix A
Source Code
The source code for this thesis can be found at: https://github.uio.no/garthft/masters-thesis.
This includes code for the creation of the data warehouse used in this thesis, code for the ETL processes
required to load data into the warehouse, and code for the data mining study conducted in this thesis.
The data warehouse is physically implemented using PostgreSQL 12.2. ETL processes are carried out
using Java 13.0.2 and Python 3.7.6. Java is primarily used for cleaning tasks which involve the parsing
CSV files. Loading tasks, however, are executed using Python together with the PostgreSQL database
adapter Pyscopg2.
Data modelling is carried out using Scikit-learn, a machine learning library for Python. Additionally, data
visualisation, data processing, and reporting make use of the following Python libraries: Pandas, NumPy,
Imblearn, Matplotlib, and Seaborn.
93
Appendix B
MultiDim Model for Data Warehouses
Figure B.1: Dimension level
Figure B.2: Fact table
95
Figure B.3: Cardinatlities
Figure B.4: Dimension types
Figure B.5: Balanced hierarchy
96
Figure B.6: Ragged hierarchy
97
98
Appendix C
BPMN Notation
Figure C.1: General notation
99
Figure C.2: More general notation
Figure C.3: Events
Figure C.4: Gateways
100

Garthft Masters Thesis

Uploaded by

Copyright:

Available Formats

You might also like

Garthft Masters Thesis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Garthft Masters Thesis

Uploaded by

Copyright:

Available Formats

The use of Data Mining for Predicting

Injuries in Professional Football Players

Thesis submitted for the degree of

Printed: Reprosentralen, University of Oslo

4 Data Warehouse Modelling 19

4.2.1 The Initial Analysis-Driven Conceptual Schema . . . . . . . . . . . . . . . . . . 28

B MultiDim Model for Data Warehouses 95

1.1 Two phase approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 An illustration of a three-dimensional data cube . . . . . . . . . . . . . . . . . . . . . . 7

4.1 Phases in the data warehouse design process . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1 Visual overview of injury occurrence for the 2019 season . . . . . . . . . . . . . . . . . 60

B.1 Dimension level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

C.1 General notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.1 Features extracted from Catapult OptimEye X4 . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Fact summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

1.2 Problem Statement

1. The creation of a data warehouse.

2. Data mining to predict future injury.

Figure 1.1: Two phase approach

1.4 Scope of this Thesis

1.5 Structure of this Thesis

The structure for this thesis is as follows:

2.1 Data Warehousing

2.1.1 Motivation for a Data Warehouse

2.1.2 Overview of Data Warehousing

Figure 2.1: An illustration of a three-dimensional data cube

3. Slice removes a dimension from the cube.

4. Dice filters cells according to a Boolean condition.

2.2 Data Mining

2.2.1 The CRISP-DM Process Model

Figure 2.2: The CRISP-DM data mining life cycle

Each phase of the data mining cycle is outlined as follows:

2.3 Machine Learning in Football

2.3.1 Workloads in Predictive Modelling

2.3.2 A Machine Learning Approach

2.4 Related Work

account for the difference in results achieved by the two studies.

3.1 GPS Data

Interval Interval number for current session

Table 3.1: Features extracted from Catapult OptimEye X4

Two of the features presented above require further explanation:

3.2 Injury Data

3.3 Training Log Data

Activity Type of training performed

3.4 First Impressions

Data Warehouse Modelling

Figure 4.1: Phases in the data warehouse design process

4.1 Requirements Specification

4.1.1 Analysis Driven Approach

• To identify associations between workload features and player injury.

1. To be able to extract a player’s GPS training features at multiple granularities.

2. To be able to extract injury and illness features for each player.

3. To be able to extract a player’s biometrics.

1. To be able to extract a player’s GPS training features at multiple granularities.

2. To be able to extract injury and illness features for each player.

(a) The duration of an injury/illness.