Professional Documents
Culture Documents
Soccer Analytics
Soccer Analytics
by
Lucas Yifan Wu
M.Sc., Simon Fraser University, 2018
B.Sc., Simon Fraser University, 2017
in the
Department of Statistics and Actuarial Science
Faculty of Science
Copyright in this work is held by the author. Please ensure that any reproduction
or re-use is done in accordance with the relevant national copyright legislation.
Declaration of Committee
Timothy Swartz
Supervisor
Professor, Statistics and Actuarial Science
Boxin Tang
Committee Member
Professor, Statistics and Actuarial Science
Oliver Schulte
Examiner
Professor, Computing Science
Ian McHale
External Examiner
Professor, Management School
University of Liverpool
ii
Abstract
This thesis consists of a compilation of four projects all related to soccer. The first short
chapter investigates how to obtain reliable speed measurements from player tracking data.
The second chapter considers the problem of crossing the ball in soccer. In recent years,
some research suggests that there exists a negative correlation between crossing and scoring.
However, correlation does not imply causation. There are various factors that affect the
decision of crossing. In the crossing problem, an experimenter can not assign whether a
player crosses or does not cross the ball during a particular crossing opportunity due to
the fact that matches are observational studies. For this reason, we use a causal inference
framework to investigate the causal relationship of crossing on shots. Our findings suggest
that crossing remains an effective tactic for increasing shot probabilities.
The third chapter considers the evaluation of off-the-ball actions in soccer. There are numer-
ous statistics and metrics that have been proposed to evaluate the performance of players
in team sports based on actions involving the ball. In soccer, players typically don’t have
the possession of the ball for even three minutes during a game. In this paper, we develop
methods that analyze the activities of players that are “off-the-ball”. Then a defensive antic-
ipation metric is developed based on the tenet that moving faster to the expected location
is better than moving slower.
The last chapter considers the problem of pitch control in soccer. With the availability
of tracking data, one of the most intriguing ideas in soccer is to model how much space
the player or the team owns at any given time, which is known as pitch control or field
ownership in soccer analytics community. This project first conducts a literature review
on various approaches for the determination of pitch control and introduces a new field
ownership metric that takes into account associated movement dynamics, such as speed,
acceleration and change of direction etc.
Keywords: Sports Analytics; Player Tracking Data; Causal Inference; Machine Learning;
Pitch Control.
iii
Acknowledgements
First and foremost, I would like to express my sincere gratitude to my senior supervisor
Dr. Tim Swartz as I am deeply indebted to his continual support and guidance. This thesis
would not have been possible without him. He saw the potential in me, drafted me as his
PhD student and encouraged me to pursue a career in Sports Analytics.
I am extremely grateful to my examining committees for their thorough reading and
valuable comments on my thesis, Dr. Boxin Tang, Dr. Oliver Schulte and Dr. Ian McHale.
Special thanks to Dr. Liangliang Wang for chairing my defence.
I would also like to thank my All-Star teammates Dani Chu, Matthew Reyers, James
Thomson and Meyappan Subbaiah. Without these amazing teammates, it would be impos-
sible to win the Big Data Bowl. Many thanks to the former and current SFU Sports Ana-
lytics members who help to make SFU a Sports Analytics hub, Dr. Dave Clarke, Dr. Peter
Chow-White, Dr. Tim Swartz, Dr. Thomas Loughin, Dr. Luke Bornn, Dr. Oliver Schulte,
Dr. Peter Tingling, Dr. Aaron Danielson, Dr. Harsha Perera, Dr. Jacob Mortensen, Dr.
Nate Sandholtz, Sarah Bailey, Matthew Van Bommel, Steven Wu, Peter Tea, Kevin Floyd,
Robert Nguyen, Denis Beausoleil, Daniel Daly Grafstein, Chris Li, Ken Peng, Nirodha Es-
pasinghege Dona, Aaron Pearson, Robyn Ritchie, Ryker Moreau, Elijah Cavan, Brendan
Kumagi, James Thomson, Dani Chu and Matthew Reyers.
I am grateful to all the faculty members in the department of Statistics and Actuarial
Science who oversaw a kid hanging around for years, especially Dr. Dave Campbell for
sparking my interests in machine learning. In addition, I would like to thank all my lovely
friends and fellow MSc and PhD students for all the tears, laughters, fears and hopes we
shared.
I would like to extend my sincere thanks to Dr. Doug Fearing, Dr. Luke Bornn and all
of my co-workers at Zelus Analytics for their support and help throughout the pandemic.
Special shout-out to COVID-19 which makes everyone’s life much more difficult but we
have grown stronger together.
Last but not least I would like to thank my girlfriend and family, especially my parents
for their unconditional love and support.
iv
Table of Contents
Declaration of Committee ii
Abstract iii
Acknowledgements iv
Table of Contents v
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
v
3.5 Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 Propensity Score Matching . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Pitch Control 45
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 A New Metric for Pitch Control . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Criteria for Pitch Control . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.2 Timing of the Ball . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.3 Timing of Players . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Accuracy of the Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Bibliography 62
vi
List of Tables
Table 3.1 A subset of situational variables relevant to crossing which form the
columns of the design matrix Z. All distances are measured in metres. 18
Table 3.2 Estimates and standard errors for the parameters corresponding to
model (3.1). The third column provides the estimate multiplied by the
mean value of its corresponding covariate.The fourth column marginal
effect is the product of the estimate and the standard deviation of the
corresponding z terms. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Table 3.3 The key situational variables that are relevant to crossing success as
modeled in Section 3.4. All distances are measured in metres, speed is
measured in metres/second, angles are measured in degrees, and areas
are measured in squared metres. . . . . . . . . . . . . . . . . . . . . . 21
Table 3.4 Estimates of the parameters from the intended target model and other
related statistics. The estimates describe associations between spatio-
temporal features and the successful completion of an attempted cross. 22
Table 4.1 The defensive anticipation metric P calculated during even and odd
weeks for players on Shandong Luneng during the 2019 season. . . . . 40
Table 4.2 The defensive anticipation metric P given by (4.2) for 10 players on
Shandong Luneng who received the most playing time during the 2019
CSL season. We also provide comparison metrics involving aggression
during the 2019 season, namely the total number of fouls committed,
tackles made and the number of interceptions. . . . . . . . . . . . . . 42
Table 5.1 The determination of pitch control at a given location given time in-
equalities involving tb , th and tr . . . . . . . . . . . . . . . . . . . . . . 52
Table 5.2 The classification of 7901 intended passes according to whether pitch
control (PC) was designated to the intended team, the opponent or
neither team. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
vii
List of Figures
Figure 2.1 Path of a player over a 29-second interval based on location data
recorded at 10 hertz. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 2.2 Estimated speed (∆ = 1) of the player corresponding to the path in
Figure 2.1 over a 29-second interval. . . . . . . . . . . . . . . . . . 9
Figure 2.3 Estimated speed (∆ = 4) of the player corresponding to the path in
Figure 2.1 over a 29-second interval. . . . . . . . . . . . . . . . . . 10
Figure 2.4 The red-lined plots correspond to speed and acceleration estimates
(∆ = 1) for Brandin Cooks of the NFL during a 7-second time
interval. The analogous blue-lined plots correspond to ∆ = 2. . . . 11
Figure 3.1 Examples of possession sequences with (a) a crossing attempt and
(b) without a crossing attempt. . . . . . . . . . . . . . . . . . . . . 17
Figure 3.2 Panels (a) and (b) present output from the intended target model.
These diagrams provide a way for teams to study the spatial config-
urations of players and the ball during crossing opportunities. . . . 23
Figure 3.3 The directed acyclic graph describes the crossing problem. The vari-
ables ZT are causes of T, but not Y . The variables ZTY are common
causes for T and Y . And, the variables ZY are causes for Y , but not
T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 3.4 After matching, histograms of the two groups (treatment and con-
trol) are depicted where the horizontal variable is the propensity
score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 3.5 After matching, smoothed plots of the shot variable Y for both
groups with respect to the propensity score. . . . . . . . . . . . . . 27
Figure 4.1 Correlation of predicted speed at time t and actual speed at time t−∆
where time is measured in seconds. The blue dashed line corresponds
to the selected value ∆ = 0.5 seconds. . . . . . . . . . . . . . . . . . 36
viii
Figure 4.2 Geometric diagram which illustrates the components of the statis-
tic p in equation (4.1). Imagine a player who is located at the origin
(0, 0). The observed velocity of the player is shown by the blue vector
pointing towards (2, 4). The predicted velocity of an average player
is shown by the yellow vector pointing towards (8, 4). The perpen-
dicular line indicates the projection of the observed velocity vector
on the predicted velocity vector. Using equation (4.1), the defensive
anticipation value, p, is equal to −0.6, which can be interpreted as a
60% reduction compared to the average player. . . . . . . . . . . . 37
Figure 4.3 Plot of predicted velocities (purple arrows) and observed velocities
(black arrows) at a given instant in time. The blue team is in pos-
session, the yellow team is defending and the red dot corresponds to
the ball. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 4.4 Density plots of (4.2) based on playing position. For each player,
the defensive anticipation metric (4.2) was calculated for all matches
in the 2019 CSL season. We observe that central midfielders have
slightly larger defensive anticipation values than other players on
average, and there is more variability amongst the forwards than the
other playing positions. . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 4.5 Scatterplots of the defensive anticipation metric (4.2) plotted against
player interceptions and tackles made during the 2019 CSL season. 42
Figure 4.6 Plot of the defensive anticipation metric (4.2) averaged over all CSL
players during 10-minute intervals. . . . . . . . . . . . . . . . . . . 43
Figure 5.1 Voronoi diagram based on n = 5 points generated on the unit square. 47
Figure 5.2 Voronoi diagram applied to a given snapshot of a soccer game based
on the location of the 22 players on the pitch. The shaded orange
and purple areas correspond the dominant regions for the home and
away teams, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 5.3 The distribution of maximum speed and maximum acceleration of
all players in the Chinese Super League in 2019. . . . . . . . . . . . 54
Figure 5.4 Current velocity vectors for the example depicted in Figure 5.2. . . 57
Figure 5.5 The left plot uses colors to depict the time that it takes a stationary
player to reach field locations given the current location marked with
a dot. The right plot does likewise but introduces an initial velocity
(arrow) for the player. . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 5.6 Pitch control diagram using the proposed methods for the example
depicted in Figure 5.2. . . . . . . . . . . . . . . . . . . . . . . . . . 59
ix
Chapter 1
Introduction
1.1 Introduction
Sports analytics is an emerging field where it combines sports with multidisciplinary knowl-
edge and expertise, such as statistics, computing science, sports science and business, to
support decisions in player evaluation, injury prevention, business operations, etc. The book
and movie Moneyball was one of the first influences that put sports analytics in front of
the eyes of the public. The Moneyball movement started in baseball and has swept across
multiple sports in a few years.
Humans are often clouded by personal judgement when making decisions. This was also
featured in the movie Moneyball, "People are overlooked for a variety of biased reasons and
perceived flaws - age, appearance, personality." One of the common biases is recency bias,
where we tend to weight the most recent event more significantly than it should be. For
example, when a player has just made a poster dunk, we are more likely to remember that
highlight and downweight the fact he gave away five easy layups to his opponents earlier.
In the end, we might only remember one moment of brilliance and come away with the
perception that the player had an amazing game. It is fairly easy to find many similar
examples in sports and how this type of bias can hinder player evaluation in sports.
Baseball is one of the earliest sports that embraced the idea of using numbers to inform
decisions. Moving beyond replying on pure instinct to evaluate players is a huge leap for
sports analytics. Back in the early days, the only available data were box score statistics
involving summary statistics of a few categories. As teams recognized the importance of
getting more granular data, they began to collect event data, which provide finer details on
the sequence of events and players being involved for the recorded event.
In my opinion, I would like to argue that this was the first wave of evolution in sports
analytics, where there was a shift of mindset to adopt numbers to analyze player performance
objectively. The second wave of evolution came with the accessibility of tracking data.
Although event data provide a rich amount of contextual information, event data do not
describe what other players are doing when they do not possess the ball or are not involved
1
in the recorded event. Tracking data fill the gap by collecting detailed information, such as
the x,y coordinates, of the ball and all players on the field multiple times per second. With
the availability of spatio-temporal tracking data, it unlocks a new world for researchers to
explore and to tackle questions that they were not able to answer. Plenty of interesting
research has been done using tracking data in baseball, basketball, soccer and football since
then.
• Wu, L. and Swartz, T.B. (2022). The calculation of player speed from tracking data.
International Journal of Sports Science & Coaching, 0(0).
Chapter 3 considers the problem of crossing the ball in soccer. In recent years, some
research suggests that there exists a negative correlation between crossing and scoring.
However, correlation does not imply causation. There are various factors that affect the de-
cision of crossing, including the position of the cross, the defensive pressure on the crosser,
the distance between the crosser and his teammates, the score differential, the number of
defenders in the box, etc. In general, randomized controlled trials are the gold standard ap-
proach to estimate the causal effects of a treatment on an outcome. In the crossing problem,
an experimenter can not assign whether a player crosses or does not cross the ball during
a particular crossing opportunity due to the fact that matches are observational studies.
For this reason, we use a well-established method under the causal inference framework -
propensity score matching to investigate the causal relationship of crossing on shots. This is
one of the few papers that considers a causal inference approach in team sport, which utilizes
player tracking data to identify and measure confounding variables. Our findings suggest
that crossing remains an effective tactic for increasing shot probabilities. This chapter has
been published as the following research article:
• Wu, L., Danielson, A., Hu, J.X. and Swartz, T.B. (2021). A contextual analysis of
crossing the ball in soccer. Journal of Quantitative Analysis in Sports, 17(1), 57-66.
Chapter 4 considers the evaluation of off-the-ball actions in soccer. There are numerous
statistics and metrics that have been proposed to evaluate the performance of players in
2
team sports based on actions involving the ball. In soccer, players typically don’t have
the possession of the ball for even three minutes during a game. In this paper, we develop
methods that analyze the activities of players that are “off-the-ball”. Specifically, we propose
a metric to measure defensive anticipation in soccer. The analogy in chess would be when
you are planning your next move, you will always try to anticipate the moves of your
opponents. Similarly in soccer, we try to conceptualize the idea of anticipation for defensive
players using expected movements at the next moment given a snapshot of the game. The
expected movement at the next moment is a function of the spatio-temporal snapshot of the
match prior to the moment in time. This provides a new way to evaluate the performance
of players off-the-ball. We used machine learning models to learn the non-linear relationship
between the contextual variables and velocity from a massive set of game instances. The
output from the model which we termed the predicted (expected) velocity represents where
the player is expected to move and how fast he is expected to move on average. Then a
metric is developed by comparing the player’s actual velocity with the predicted velocity
of a typical player in this situation. The interpretation of the defensive anticipation metric
is based on the tenet that moving faster to the expected location is better than moving
slower. This chapter is under revision at Statistica Applicata - Italian Journal of Applied
Statistics:
• Wu, L. and Swartz, T.B. (2022). Evaluation of off-the-ball actions in soccer. Manuscript
under review.
Chapter 5 considers the problem of pitch control in soccer. With the availability of track-
ing data, one of the most intriguing ideas in soccer is to model how much space the player
owned at any given time, which is known as pitch control or field ownership in the soccer
analytics community. This chapter first reviews various approaches for the determination
of pitch control and introduces a new metric that takes into account associated movement
dynamics of the ball and players. With the pitch control model, we could determine if the
home team or road team or neither team has the control at any given location on the field.
This approach is generally applicable to invasion sports and is illustrated in the context of
soccer. This chapter has been submitted to Scientific Reports:
• Wu, L. and Swartz, T.B. (2022). A New Metric for Pitch Control based on an Intuitive
Motion Model. Manuscript under review.
3
Chapter 2
2.1 Introduction
In the past decade, the advent of player tracking data has sparked a revolution in sports
analytics (Morgulev, Azar and Lidor 2018). With player tracking data, analysts have access
to the Cartesian coordinates of each player on the pitch where the observations are recorded
frequently (e.g. 10 times per second). The availability of such detailed data provides oppor-
tunities to investigate sporting questions that were previously unimaginable. Gudmundsson
and Horton (2017) provide a review paper on spatio-temporal analyses used in invasion
sports where player tracking data are available.
Currently, player tracking systems are expensive, and consequently, tracking data are
only collected in “big” sports such as basketball (the National Basketball Association),
soccer (various leagues and competitions), football (the National Football League) and
hockey (the National Hockey League). Tracking data are not only collected during matches
but also during workout sessions where fitness, training and health considerations are main
concerns.
Tracking data are typically proprietary and are supplied by service providers using
various technologies (Torres-Ronda et al. 2022). There are four prominent technologies: (1)
global positioning systems (GPS), (2) local positioning systems (LPS), (3) inertial measure-
ment units (IMU) and (4) optical tracking (OT) systems. OT systems are fundamentally
different as they do not require wearable devices and do not directly determine player coor-
dinates. Instead, OT technology requires advanced camera systems and player recognition
software to evaluate player coordinates. No matter which technology is utilized, tracking sys-
tems begin with the collection of the (x, y) coordinates of participants measured at frequent
time intervals. With the coordinates, various statistics can be calculated or approximated
(e.g. speed, acceleration, distance travelled, etc.).
4
In this paper, we are concerned with derivative calculations associated with tracking
data coordinates. Specifically, we are interested in the approximation of player speed which
is an important statistic in sports analytics and sports science. For example, Wu and Swartz
(2022) require player speeds in soccer to assess off-the-ball activity. They introduce a mea-
sure which addresses defensive anticipation. Buchheit et al. (2014) use regression method-
ology to determine factors that are associated with player speed in soccer. For example,
horizontal force and horizontal power were seen to be associated with speed. Oliva-Lozano
et al. (2020) characterize positional differences in soccer based on acceleration and sprint
profiles. Related to speed, Shen, Santo and Akande (2022) analyze pace of play in soccer, and
conclude that pace increases with decreasing team quality, which indicates the importance
of playing with pace. From a training and performance perspective, Ferrari Bravo et al.
(2008) demonstrate that sprint-training significantly increases both aerobic and anaerobic
performances in soccer. Naturally, different applications require different levels of accuracy.
For example, in sports science, critical velocity is an active research field which relies on
highly accurate measurements of speed (Peng, Clarke and Swartz 2022).
Much has been written on the accuracy of various tracking data technologies. For ex-
ample, Mara et al. (2017) considered the displacement accuracy of an OT system, Tan,
Polglaze and Peeling (2021) investigated the validity and accuracy of a GPS system, and
Pino-Ortega et al. (2022) provided a review of the validity and reliability of LPS systems
against other devices. Massard, Eggars and Lovell (2017) questioned the need for sprint
testing based on the comparison of GPS match and field-testing data. However, all of these
investigations rely on some measure of the truth against which tracking measurements are
compared. What should experimenters do if they do not have access to the truth and they
are unsure of the accuracy of speed calculations obtained from tracking data? This paper
introduces some simple principles from exploratory data analysis that assists experimenters
to obtain more reliable estimates of speed.
In Section 2.2, we describe the datasets upon which our methods are illustrated, and we
describe how player speed is calculated from tracking data coordinates. In Section 2.3, some
simple exploratory plots are introduced that help the analyst obtain more reliable speed
calculations. We conclude with a short discussion in Section 2.4.
2.2 Data
We have access to tracking data from matches during the 2019 season of the Chinese Su-
per League (CSL). The CSL uses OT technology (previously discussed) provided by Stats
Perform where observations were recorded 10 times per second. The tracking data consist
of roughly one million rows per match measured on 7 variables. Each row corresponds to a
particular player at a given instant in time. The soccer tracking data were initially provided
as xml files, and were processed in R for further analysis. In Table 2.1, we present three
5
rows of the soccer tracking data. Here we observe x-y coordinates and player identifiers
at every 1/10th of a second. The entries are mostly intuitive except perhaps for the x-y
coordinates which refer to the player location on a 105m by 68m soccer field. For example,
(x, y) = (−52.5, 0) corresponds to the middle of the goal line on the left hand side of the
soccer field.
Our second dataset corresponds to tracking data from the National Football League
(NFL). Unlike the OT soccer data, the NFL data were based on GPS technology, but were
also collected using 10 hertz sampling frames. The data were used in the 2019 Big Data
Bowl competition and are publicly available at https://github.com/nfl-football-ops/Big-
Data-Bowl. Here we use data corresponding to a single deep pass play by the wide receiver
Brandin Cooks of the New England Patriots taken from a 7-second interval during the
September 7/2017 match against the Kansas City Chiefs. In Table 2.2, we present three
rows of the football tracking data. Here we observe a similar structure to the tracking data
in soccer. The football tracking data include the x-y coordinates for players measured in
yards where x refers to the player position along the long axis of the field ranging from 0 to
120 yards, and y refers to the player position along the short axis of the field ranging from
0 to 53.3 yards. For instance, (x, y) = (0, 0) corresponds to the bottom left of the football
field. The remaining variables in Table 2.2 are mostly intuitive where dis corresponds to
distance travelled from the previous frame (i.e. previous 1/10th second) and dir corresponds
to the angle of player motion in degrees. The frame.id is the frame identifier for each frame
which resets to 1 for each play.
6
Consider then a particular player where our interest concerns the calculation of their
speed. If (x(t), y(t)) denotes the location of the player at time t, then the player’s speed at
time t is defined by
In words, formula (2.1) is the limiting change in distance travelled with respect to time.
Of course, (2.1) is a mathematical expression based on taking a limit, and is not a quantity
that can be calculated from data. Instead, with tracking data, the player’s locations are
obtained at regular times which are denoted by (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). Here, the
subscripts i = 1, . . . , n of the Cartesian coordinates refer to the time increments. Therefore,
assuming that t corresponds to an observed time increment from the tracking data, it is
reasonable to approximate s(t) in (2.1) by
7
Then relative error RE is given by
We note that the relative error (2.3) is smaller for larger speeds (i.e. greater changes in
location ∆l ). For example, when ∆ = 1, consider a true location displacement ∆l = 8
metres which is incorrectly measured as 9 metres. Then the actual speed is 8.0 metres/sec
(fast), the observed speed is 9.0 metres/sec, and the measurement error is E = 1 metre.
This results in relative error RE = 0.125. For contrast, when ∆ = 1, consider a true location
displacement ∆l = 2 metres which is incorrectly measured as 3 metres. Then the actual
speed is 2.0 metres/sec (slow), the observed speed is 3.0 metres/sec, and the measurement
error is E = 1 metre. This results in relative error RE = 0.50.
Starting point
Figure 2.1: Path of a player over a 29-second interval based on location data recorded at 10
hertz.
However, when we take the path locations in Figure 2.1, and estimate speeds (2.2) using
∆ = 1, there seems to be a significant accuracy problem. Figure 2.2 provides a plot of
estimated speed versus time for the selected path. In Figure 2.2, we observe that there
are many instances where a player has a recorded speed which increases (or decreases) by
8
roughly 1.0 metre per second in the subsequent 1/10th second, and then returns to the
baseline speed 1/10th of a second later. When speeds are recorded in the (0,8) metres per
second range, frequent fluctuations of this magnitude do not seem plausible. The problem
here is that the location measurements were recorded to one decimal point on the metres
scale, and therefore, there is inaccuracy in (2.2) when dividing by 2∆ which corresponds to
0.2 seconds.
6
Estimated speed (m/s)
0 4 8 12 16 20 24 28
Time elapsed in seconds
Figure 2.2: Estimated speed (∆ = 1) of the player corresponding to the path in Figure 2.1
over a 29-second interval.
A remedy to the estimation of the instantaneous speed s(t) is to increase the time
increment ∆ surrounding t. Increasing the length of the time interval 2∆ results in less
fluctuation in the estimated speeds which is desirable. However, this is done at the expense
of moving in the direction from instantaneous speeds to average speeds. We have found that
the approximation ∆ = 4 works well in this application. Figure 2.3 provides the analogous
plot to Figure 2.2 where the time intervals have been widened to intervals of length 0.8
seconds. In Figure 2.3, we observe that the fluctuations are less pronounced, and that the
plot of estimated speed versus time is smoother. For example, the fluctuations during the
interval 16-18 seconds in Figure 2.2 are less believable than what is observed in Figure 2.3.
We refer back to the theoretical analysis of relative error at the beginning of Section
3. In this example, we have seen that we prefer the time increment ∆ = 4 over ∆ = 1.
With ∆ = 4, speed ∆l /2∆ = 8 metres/sec and location measurement error E = 1 metre,
this implies ∆l = 64 metres and relative error RE = E/64 = 0.015625. With ∆ = 1, speed
∆l /2∆ = 8 metres/sec and location measurement error E = 1 metre, this implies ∆l = 16
metres and relative error RE = E/16 = 0.0625. Therefore, ∆ = 4 is preferred over ∆ = 1
in reducing relative error. This exercise can be repeated for any speed.
Issues which arise in speed measurements are a consequence of the fact that speed is
the derivative of position, and that position is not measured with sufficient accuracy. In
applications where acceleration measurements are important, one can imagine even greater
9
6
0 4 8 12 16 20 24 28
Time elapsed in seconds
Figure 2.3: Estimated speed (∆ = 4) of the player corresponding to the path in Figure 2.1
over a 29-second interval.
challenges since acceleration is the derivative of speed. This is illustrated in the following
example.
10
and same (i.e. only one change in direction). With respect to the estimation of acceleration,
∆ = 2 is preferred over ∆ = 1.
20
10
delta = 1
delta = 2
2 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Time elapsed in seconds Time elapsed in seconds
Figure 2.4: The red-lined plots correspond to speed and acceleration estimates (∆ = 1) for
Brandin Cooks of the NFL during a 7-second time interval. The analogous blue-lined plots
correspond to ∆ = 2.
2.4 Discussion
Tracking data have provided opportunities to study problems in sports analytics which
were once unimaginable. However, sound tracking data analyses require data that are reli-
able, and the reliability of tracking data statistics often degrade with increasingly complex
statistics. We have provided some simple principles from exploratory data analysis to help
experimenters derive more reliable estimates of player speed. The same principles can be
utilized in the calculation of velocities and accelerations.
The principles developed here are general and can be used with any type of player track-
ing system in any sport. The experimenter needs to consider the estimands of interest. The
experimenter also requires domain knowledge of the sport to assess whether the resultant
variations in the estimates are reasonable.
An avenue of future research may involve the implementation of statistical methods to
smooth estimates of speed and acceleration. For example, one might consider the Hodrick-
Prescott filter to smooth estimates of speed (Hodrick and Prescott 1997).
Instead of having experimenters manually estimate speed from (x, y) coordinates, some
tracking data providers automatically provide speed statistics. Coleman (2018) describes
the procedure that the data provider Opta uses in calculating top speeds for players in
soccer: “The speed in kilometers per hour for a given frame is based on the previous 15
frame-to-frame speeds. Out of the 15 frame-to-frame speeds, the four highest and the four
lowest values are discarded and the result is an average of the remaining seven values.”
Given that speed is of great importance in sports analytics, we suggest that it would be
11
good practice for the providers to be explicit about the the derivation and justification of
their speed calculations.
12
Chapter 3
3.1 Introduction
The sport of soccer (association football) has a long history dating back to 1863 when
the Laws of the Game were codified by the Football Association in England. Throughout
the history of the sport, tactics have evolved with the intention of providing a competitive
advantage (Wilson 2013). As a strategy, the action of crossing the ball in soccer has always
been a staple of the game that has been thought to produce goals. A crossed ball occurs
when a player (normally situated in a wide area of the attacking third of the pitch) kicks
the ball towards the box with the intention that an attacking teammate will score.
However, in recent years, research has been carried out that casts doubt on the benefits
of crossing the ball. Vecer (2014) provides a persuasive argument that the overall effect of
crossing the ball has a strong negative impact on scoring. Vecer (2014) uses both aggregate
crossing statistics and multilevel Poisson regression to study the impact of crossing. In the
analyses, there is a suggestion that crossing (when executed properly) is valuable; however,
the rate of bad crosses greatly exceeds the rate of good crosses, and this is a primary
argument against crossing. Vecer (2014) also demonstrates that missed scoring opportunities
due to open crossing is associated with the quality of the attacking team. In recent years,
teams have become more reluctant to cross the ball. For example, Vecer (2014) states that
the number of open crosses in the German Bundesliga dropped from 12.0 per match in
the 2009/2010 season to 8.9 per match in the 2015/2016 season, a decrease exceeding 25%.
Vecer (2014) analyzes the efficiency of crossing and found that 14.5% of the goals scored
were the results of open crosses in English Premier League. We found a similar story in
Chinese Super League, where 16.9% of the goals were scored from open crosses in 2019
season.
Sarkar (2018) investigates crosses from a game theoretic perspective. They assume the
attacking team can cross the ball or not, and the defending team can utilize an offside
13
trap or not. The vector of equilibrium strategies determines the probabilities of the possi-
ble outcomes. Somewhat surprisingly, Sarkar (2018) suggests that teams that are good at
aspects of executing a cross should cross the ball less often. Sarkar (2018) and Sarkar and
Chakraborty (2018) also confirm the inverse relationship between the number of crosses and
the number of goals scored in a match. Other papers that have provided nuanced views on
the negative effects of crossing include Liu et al. (2015) and Oberstone (2009).
Given the longstanding history of crossing the ball in soccer, the conclusions reached
by Vecer (2014) and Sarkar (2018) have been surprising to many, including the authors of
this paper. We hypothesize that there are contexts in which crossing the ball in soccer is a
beneficial strategy. Knowing when to the cross the ball is a step in the direction of effective
playing strategy. Our contextual investigation is made possible by the availability of player
tracking data. Player tracking data in soccer consists of the (x, y) coordinates of the ball and
the 22 players on the pitch recorded at regular and frequent time intervals. Player tracking
data in sport are the catalysts for big data analyses and do not form part of the analyses by
Vecer (2014) and Sarkar (2018). Gudmundsson and Horton (2017) provide a review paper
on spatio-temporal analyses used in invasion sports (including soccer) where player tracking
data are available. The analysis of player tracking data has been particularly prominent in
the sport of basketball; see for example, Miller et al. (2014).
Although tactical decisions are a fundamental aspect of sport, sporting decisions are
not typically based on the results of randomized designs, the bread and butter of causal
inference. Clearly, in professional sport, match outcomes are important and coaches would
be unwilling to implement a tactic in a random selection of games and then implement an
alternative tactic in a remaining subset of games. There are many approaches that estimate
causal effects with observational data (see Pearl 2009), but these methods have not received
much attention in the sports analytics literature. One exception is the work of Yam and
Lopez (2019) who investigate the impact of “going for it” on fourth down in the National
Football League as opposed to punting or kicking a field goal. Their approach is based
on matching propensity scores and covariates associated with game situations. As another
example, Toumi and Lopez (2019) use propensity score matching and Bayesian additive
regression trees to estimate the causal effects of zone-entry decisions in the National Hockey
League.
Our work uses spatio-temporal data to investigate three aspects of the crossing problem
in soccer. First, we investigate the spatio-temporal conditions that lead to crossing. Then we
introduce an intended target model that investigates crossing success. Finally, a contextual
analysis is provided that assesses the benefits of crossing in various situations. The analysis
is based on causal inference techniques and suggests that crossing remains an effective tactic
in particular contexts.
Section 3.2 introduces the dataset. We outline the steps involved in converting the player
tracking data into features that are used in the ensuing analyses. The resultant design matrix
14
consists of rows that correspond to crossing opportunities and columns (covariates) that are
believed to related to aspects of crossing. Our analysis is based on various assumptions used
in the definition of a crossing opportunity and on the definition of outcomes arising from
crossing opportunities. In cases where the rationale for the assumptions is less clear, we
introduce tuning parameters so that analyses can be carried out using a range of values of
the tuning parameters.
Section 3.3 is concerned with the spatio-temporal conditions that lead a player to cross
the ball. We develop a logistic regression model which relates the attempt (or non-attempt)
to cross the ball to covariates (situational variables) which are believed to be related to the
crossing decision. We observe that the model makes physical sense according to our under-
standing of soccer. The fitted model provides evidence of the rich information embedded in
the player tracking data. The logistic model is subsequently used in the causal analysis of
Section 3.5.
Section 3.4 develops an intended target model. The model introduces additional covari-
ates that are relevant to the probability of success of a cross. The analysis concerns a sender
(the player contemplating the cross) and potential receivers (players to whom the cross
may be intended). The intended target model provides insight to whom a cross ought to
be made. Again, the fitted model aligns with our understanding of soccer. The information
gleaned from the model may benefit players and coaches in terms of tactical decisions.
In Section 3.5, we first review concepts needed to apply basic causal inference tech-
niques to the crossing problem. Then we use propensity score matching to assess whether
crossing is beneficial. Our results are nuanced as crossing is seen to be beneficial in par-
ticular circumstances, and these circumstances are those when a player is more likely to
cross. We therefore see that the intuition of soccer players involving the decision to cross
corresponds to good decision making. And importantly, we dispel the notion that crossing
is not a valuable tactic in soccer.
Some concluding remarks are then provided in Section 3.6.
15
for internal consistency. In the Shandong Luneng dataset, tracking data are obtained from
the use of optical recognition software. The Shandong Luneng tracking data consists of
roughly 1,000,000 rows per match measured on 7 variables where the data are recorded every
1/10th of a second. Each row corresponds to a particular player at a given time. Although
the inferences gained via our analyses are specific to Shandong Luneng, it is plausible that
some of the broad insights may hold generally to high level soccer competitions.
16
(a)
(b)
Figure 3.1: Examples of possession sequences with (a) a crossing attempt and (b) without
a crossing attempt.
17
3.2.2 Crafting Situational Variables
Building on previous research that evaluates passing ability (Szczepanski and McHale 2016,
Power et al. 2017), we propose variables specific to the context of crossing.
It is a tenet of soccer that time and space are paramount factors that lead to improved
attacking outcomes. From the tracking data, it is possible to determine the location and
velocity of both the ball and the player of interest. The location and velocity measurements
form the basis for the situational variables presented in Table 3.1. Recall that the situational
variables τ, z1 , ..., z9 form the columns of a design matrix Z where the rows of Z are crossing
opportunities corresponding to the final event in a possession sequence occurring in potential
crossing zones. Although the situational variables in Table 3.1 are self-explanatory, the
variable z2 (nearest defender distance) is a measure of defensive pressure on the sender.
However, it does not account for the situation where multiple defenders are covering the
sender and the location of defender relative to sender matters. A defender standing one
meter in front of you versus one meter behind you is very different. The variable z3 indicating
the space controlled within 2 meters by the sender has been introduced using ideas from
Fernandez and Bornn (2018) and Fernandez et al. (2019). Although we experimented with
many other crossing variables, the variables presented in Table 3.1 are those that provided
excellent fit for the logistic model of Section 3.3.
Table 3.1: A subset of situational variables relevant to crossing which form the columns of
the design matrix Z. All distances are measured in metres.
18
Alternative indicator variables that we have considered for a response variable are
whether a crossing opportunity led to a shot on goal Y2 and whether a crossing oppor-
tunity led to a shot Y3 . The variable Y2 is more common than Y1 and Y3 is more common
than Y2 . For this reason, we prefer the response variable Y = Y3 . We note that shot statis-
tics (as opposed to goal statistics) are prevalent in the hockey analytics literature and are
referred to as Fenwick and Corsi (Vollman, Awad and Fyffe 2016).
Clearly, shots do not necessarily occur immediately after a cross. Therefore, we introduce
a tuning parameter k where a success (shot attempt) is defined as having occurred within
the next k events. If the team maintains possession after the ball exits the potential crossing
zone and a shot attempt occurs within the next k events, then Y = 1, otherwise Y = 0. In
this application, we set k = 5. The idea to let the play “unfold” was used by Schuckers and
Curro (2013) in the context of player evaluation in hockey. Using the above definition for Y ,
we observed 274 shots arising from the N = 2225 crossing opportunities. With the choice
k = 5, it took 2.61 seconds on average for a shot to occur after a cross. Also, the offensive
team retained possession (and did not cross the ball) 14.92% of the time (332 out of the
2225 cases). We recognize that k is a tuning parameter and we have experimented with
different values for k, such as k = 4, 6, 7 and found little difference in the results. Another
possible way of defining the response variable involves the consideration of time until a shot
occurs. For example, Espasinghege Dona and Swartz (2022) define Y according to whether
a shot occurs by the end of possession.
logit(pT ) = λ0 + λZ . (3.1)
Parameter estimates and standard errors for the significant terms corresponding to
model (3.1) are given in Table 3.2. To get a sense of the relative importance of the terms,
the third column in Table provides the parameter estimate multiplied by the mean value
of its corresponding covariate. A notable observation is that given a crossing opportunity,
crossing the ball is less frequent than not crossing the ball. For example, when the mean
values of the covariates are substituted into the fitted equation corresponding to (3.1), the
probability of a cross is Prob(T = 1) = 0.130. We also note that all of the parameters in
Table 3.2 are highly significant except for z1 (p-value = 0.040) and z9 (p-value = 0.051).
The coefficients in Table 3.2 also correspond to our soccer intuition. For example, we
see that an increase in the ratio of offensive players in the box to defensive players in the
box leads to an increased probability of crossing (i.e. positive coefficient of z6 ). The most
19
impactful covariate (column 3 of Table 3.2) is z5 which is the distance between the sender
and the endline. As the player runs towards to the end of the field, he runs out of options
and therefore his crossing probability increases.
Table 3.2: Estimates and standard errors for the parameters corresponding to model (3.1).
The third column provides the estimate multiplied by the mean value of its corresponding
covariate.The fourth column marginal effect is the product of the estimate and the standard
deviation of the corresponding z terms.
20
Variable Definition of Variable
(s)
z2 - distance between the sender and nearest defender
(s)
z5 - distance between the sender and the endline
(r)
z5 - distance between the receiver and the endline
(r)
z10 - distance between the receiver and the sideline
(r)
z11 - speed of the receiver
z12 - crossing angle between the sender and the receiver
z13 - area of convex hull formed by potential receivers
Table 3.3: The key situational variables that are relevant to crossing success as modeled
in Section 3.4. All distances are measured in metres, speed is measured in metres/second,
angles are measured in degrees, and areas are measured in squared metres.
The random variable representing the receiver of the ith attempted cross takes values in
the set Vi ∪ {0} where Ri = 0 indicates that the cross was unsuccessful. Therefore, there are
Ki + 1 possible outcomes with respect to a given crossing attempt. Let Zi denote the spatio-
temporal features associated with the ith attempted cross as given in Table 3.3, where zij
is the observed vector associated with potential receiver j during crossing opportunity i.
We again use the logistic regression framework where the probability that player j is
the successful receiver of an attempted cross is given by
n o
exp β ⊤ zij
p (Ri = j|Zi ) = P Ki
1+ j=1 exp {β
⊤z
ij }
i=1 j=0
It is worth noting this model’s relation to other parametric models. Unlike multinomial
logit models, the parameters are not indexed by the possible outcomes - the potential
receivers. In multinomial models, there are a fixed number of values the random variable
can take. But, in our case, the number of potential receivers varies by crossing opportunity.
Unlike the conditional logit model, equation (3.2) contains features specific to the sender
that do not vary between potential receivers.
We fit the intended target model given by equation (3.2) using the situational variables in
Table 3.3. Table 3.4 presents the maximum likelihood estimates which provide various soccer
insights. We have also provided Wald statistics that provide corroboration. We observe that
21
as the distance between the sender and the nearest defender grows, the probability of cross
completion increases. Senders located farther from the goal complete crosses at a higher
rate. Receivers closer to the goal and farther from the sideline are more likely to successfully
receive a cross. Wider crossing angles are associated with a higher probability of a successful
reception. Faster moving receivers are more likely to receive a cross. And, more compact
spatial configurations of potential receivers are associated with higher completion rates.
Table 3.4: Estimates of the parameters from the intended target model and other related
statistics. The estimates describe associations between spatio-temporal features and the
successful completion of an attempted cross.
In addition to modeling the crossing process conditionally, our model provides insight
as to how teams execute their offense. This can help teams recognize favorable spatial
configurations for crossing. Figure 3.2 illustrates the success probabilities for crosses. In
Figure 3.2(a), the sender is more open and located farther from the nearest sideline. As the
distance between the nearest defender and the sender becomes larger, the probability of a
completed cross increases. Also, crosses attempted by senders further from the sideline are
completed with higher probability.
22
(a) The graph depicts the probability of cross completion to each of the po-
tential targets during an offensive attack. In this example player 7 receives
the cross.
(b) In this example, the cross is incomplete. One can see that the concentra-
tion of defenders around the most likely receiver, player 13, is much higher
than the concentration of defenders around player 7 in panel (a).
Figure 3.2: Panels (a) and (b) present output from the intended target model. These dia-
grams provide a way for teams to study the spatial configurations of players and the ball
during crossing opportunities.
23
variable Y = 1 (a resulting shot) or Y = 0 (not a resulting shot) corresponds to patient
outcomes where a patient may experience improved health or not. Finally, we note that
a complication in both scenarios is that the treatment T and the response Y may both
depend on auxiliary confounding variables Z. In the crossing problem, the variables Z are
provided by the spatio-temporal tracking data.
In general, randomized experiments provide the means for investigating the cause and
effect relationship of a treatment on a response. However, in the crossing problem, an
experimenter can not assign (i.e. demand) that a player cross or not cross the ball during a
particular crossing opportunity. For this reason, we use causal inference techniques (Pearl
2009) in this retrospective framework to investigate the cause-effect relationship of crossing
on shots.
In this section we review the basic concepts used in the construction of interventional
probability distributions (e.g. propensity scores). Then we review how matching estimators
can be used to approximate an experiment with a randomized treatment. Finally, we com-
pute causal effects of crossing on attempted shots and quantify the uncertainty associated
with the estimates.
We expand on the setup via the directed acyclic graph presented in Figure 3.3 where
arrows denote causal relationships. Again, the structure is analogous to a retrospective
medical study in which a treatment is assigned to a patient. Whether the patient receives
the treatment (T = 1) or not (T = 0) depends on the sets of covariates ZT and ZTY .
The health status of the patient may be classified according to health (Y = 1) or sickness
(Y = 0). The outcome variable Y depends on T as well as on the covariates ZTY and ZY .
The variables ZY cause Y , but not T; the variables ZT cause T, but not Y ; and the variables
ZY T cause both Y and T. Causal inference requires a method to address the confounding
related to the common causes ZY T .
24
ZT T
Confounding Variables
ZY Y
Figure 3.3: The directed acyclic graph describes the crossing problem. The variables ZT
are causes of T, but not Y . The variables ZTY are common causes for T and Y . And, the
variables ZY are causes for Y , but not T.
3.5.2 Results
Following the implementation of the matching procedure, Figure 3.4 displays the balance
between the two groups with respect to the propensity scores. The similarity in the his-
25
Cross Do not cross
40
count
20
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Probability of crossing
Figure 3.4: After matching, histograms of the two groups (treatment and control) are de-
picted where the horizontal variable is the propensity score.
tograms is important as it provides confidence that the two groups are similar according to
the characteristics that affect whether a player decides to cross the ball.
The inferential component of our investigation begins with the simple two-sample test of
proportions between the two groups based on the response Y (resultant shot) as described
in 3.2.3. The quantity of interest is the average treatment effect ATE = Ȳ (1) − Ȳ (0) where
Ȳ (1) is the mean number of resultant shots when the ball is crossed and Ȳ (0) is the mean
number of resultant shots when the ball is not crossed. We obtain ATE = 0.050 with
standard error 0.020. The result is significant and suggests that crossing is beneficial in the
sense that a cross will lead to a shot 5% more often than when not crossing.
In Figure 3.5, we present a more nuanced view of the situation. For each group (treatment
and control), we smooth the variable Y with respect to the propensity score. We observe
that as the propensity score increases (i.e. conditions become more favourable to crossing)
the probability of a resultant shot increases for both groups. However, and most importantly,
we observe that the shot probability of the treatment group increases relative to the shot
probability of the control group. In practice, what this means is that players are making
correct tactical decisions. When players are more likely to cross (higher propensity scores),
they will have better offensive results (higher shot probabilities) than if they did not cross.
Therefore, the takeaway message is that crossing (when done under the right circum-
stances) is a good thing to do. And players do cross at the right times.
26
0.4
0.3
Shot probability
cross_attempt
0.2 0
1
0.1
0.0
0.00 0.25 0.50 0.75 1.00
Propensity score
Figure 3.5: After matching, smoothed plots of the shot variable Y for both groups with
respect to the propensity score.
3.6 Discussion
With access to player tracking data from the Chinese Super League, we have been able to
carry out various detailed investigations of crossing in soccer. In particular, we have devel-
oped a logistic regression model that characterizes spatio-temporal conditions for crossing
and an intended target model which explores crossing success.
However, the most important contribution of the paper concerns a reconciliation of the
results of Vecer (2014) and Sarkar (2018) who essentially state that crossing the ball in
soccer has a negative impact on scoring. The results of Vecer (2014) and Sarkar (2018) are
puzzling since the strategy of crossing the ball in soccer has a longstanding reputation as
an effective tactic. Using causal inference techniques, the message from this paper is that
crossing remains a valuable tactic. Players have an intuition as to when to cross, and on
average, when they cross, it is a beneficial time to cross. The beneficial crossing occasions
are determined by the covariate patterns in Section 3.3. For example, players ought to cross
when they are nearing the endline, when they have space and when there is a high ratio
of offensive to defensive players in the box. These nuanced insights that we have provided
are not necessarily a response to the work of Vecer (2014) and Sarkar (2018), but rather an
evolution of looking at crosses using player tracking data.
A common and legitimate criticism of the use of causal matching methods based on
propensity scores is that it is generally difficult to identify and measure confounding vari-
ables. However, we believe that sport offers one of the rare applications where the confound-
ing variable problem is manageable. Sport has well defined rules with finite time spans and
clear objectives. People understand sport well (say, compared to genetics), and consequently
there is hope for causal analyses in sport, especially with the increasing availability of de-
27
tailed player tracking data. However, even in this hopeful environment for the investigation
of causal inference, it would have been possible to introduce additional covariates. For ex-
ample, we did not account for the quality of teams, the movement of players, the position
of the keeper, the height of players in the box and additional player characteristics that are
available and could be scraped from websites such as SoFifa. We also did not consider the
EPV (expected point value) status when there is an opportunity to cross (Fernandez et al.
2019). It would be interesting to see how well our results can be generalized when using a
full season worth of data and more leagues. It is our hope that this manuscript will inspire
future causal studies in sport.
There are a number of potential future directions for the research presented here. One
would involve modifying the binary response Y = shots to the increasingly popular ex-
pected goals statistic xG (Bundesliga 2019) or the metric from the VAEP (Valuing Actions
by Estimating Probabilities) framework (Decroos et al. 2019). The methodology could be
extended from crossing - not crossing to more specific analyses such as crossing - dribbling
or crossing - passing or crossing - dribbling - passing.
Another avenue involves extensions of the intended target model in Section 3.4. For
example, it would be possible to include the space covariate z3 from Table 3.1 to provide
more information about the nature of the convex hull (covariate z13 in Table 3.3).
We wish to emphasize that our data analyses were based on a single season’s worth of
data from Shandong Luneng FC. Although the inferences gained via our analyses are specific
to Shandong Luneng, it is plausible that some of the broad insights may hold generally to
high level soccer competitions. It would be interesting to see if our results hold for other
teams and leagues. Our suspicion is that there is commonality across soccer competitions,
and that the benefit of crossing in the specific situations mentioned in this paper would
translate to other high level teams and leagues.
28
Chapter 4
4.1 Introduction
In the sport of football (soccer), it has been estimated that on average, throughout a 90-
minute match, individual players have possession of the ball for less than two minutes
(Link and Hoernig 2017). It is therefore clear that traditional “on-the-ball” statistics such
as goals, tackles, assists, shots and pass completion percentages examine only a snapshot
of overall player performance. Encouraged by the “moneyball” phenomena (Lewis 2013),
player evaluation via statistical analysis has become widespread across sports (Albert et
al. 2017). This paper considers a particular aspect of player evaluation in the context of
“off-the-ball” activity in soccer.
This paper introduces novel methods and a metric that evaluates a fundamental defen-
sive objective in soccer, namely defensive anticipation. When a defender anticipates quickly,
the defender denies the offensive team both time and space, and this contributes to win-
ning. Defensive awareness is important and is not always recognized. For example, by mov-
ing quickly, the defensive player may prevent a valuable pass which is never realized and
hence, never recorded. We apply our methods to an actual dataset, where the validity and
reliability of the metric are demonstrated.
There are currently no automatic methods (i.e. computer code) that produces metrics
for defensive anticipation. For an analyst (e.g. coach) to assess the defensive anticipation
of a player, there are two overriding difficulties. First, the analyst would need to monitor
the player for the entire 90 minutes of a match, and repeat this over many matches. This is
both time consuming and expensive. Second, the analyst would need to objectively evaluate
the player’s actions, sometimes in contexts where it is not clear what the player ought to
do. The purpose of this paper is to develop automatic methods which objectively evaluate
defensive anticipation. With these methods, information on defensive anticipation could be
29
made available for players from various leagues across the world. Therefore, we believe that
our methods may be beneficial with respect to player acquisition.
Our investigation is made possible by the availability of player tracking data. Player
tracking data in soccer consists of the Cartesian coordinates of the ball and the 22 players
on the pitch recorded at regular and frequent time intervals. With player tracking data, we
know the locations of all players at all times during a match, and this facilitates off-the-ball
evaluation. Gudmundsson and Horton (2017) provide a review paper on spatio-temporal
analyses used in invasion sports (including soccer) where player tracking data are available.
The visualization of team formations in soccer is a problem that has received particular
attention (Wu et al. 2019). The analysis of player tracking data has also been prominent in
the sport of basketball; see for example, Miller et al. (2014).
The study of off-the-ball activity is a new research area of great potential. Historically,
a limiting factor for such research has been the availability of tracking data. Tracking data
are necessary because we need to know what all players are doing at all times - this is the
basis for off-the-ball studies. There has been some off-the-ball analyses in basketball and
soccer that are based on the concept of “ghosting” (Lowe 2013, Le et al. 2016, Le et al. 2017
and Seidl et al. 2018). The rationale behind ghosting is that there are optimal and expected
paths for defensive players. In the ghosting work (which is proprietary), a main contribution
is the claim that if defensive players can replicate the optimal ghosting paths, then outcomes
would improve for the defensive team in terms of lower expected points/goals by the offensive
team. Also, coaches may be able to assess what-if scenarios. That is, if a given play is drawn
up, the expected ghost paths may indicate how the defensive team ought to respond. In the
ghosting approach, actual match sequences are studied from a given frame where observed
defensive positions are established. Then time frames are allowed to advance where the
offensive players continue on their observed path and the ghosts react to the offensive
movement. A limitation is that in real matches, offensive players move and react according
to the defense. Therefore, the offensive movements that were observed cannot be utilized
as responses to the ghosting paths. Spearman (2018) also used tracking data to investigate
off-the-ball activity through positioning. Goal scoring probabilities were estimated at player
locations using expected goal (xG) considerations and the probabilities of making successful
passes to the player locations. This interesting line of research is instructive in identifying
optimal positioning from an offensive perspective.
There are several other papers related to our work. Yurko and Pelechrinis (2021) used
a Long short-term memory model (LSTM) to estimate the locations of “ghost” defenders
at the moment the ball was caught. This aids in the evaluation of defenders in limiting
the number of yards after the catch in the National Football League (NFL). Also in the
NFL, Cheong et al. (2021) used a deep learning model to predict trajectories of defensive
players to investigate various questions including “what-if” scenarios. Stöckl et al. (2021)
introduced a graph convolutional network to measure an aspect of defensive performance
30
in soccer. The approach attempts to assess how defender actions modify offensive behavior.
Llana et al. (2020) introduced the concept of off-ball advantage which builds on top of
the expected possession framework from Fernández at al. (2019). There is also a branch
of literature referred to as pitch control that is concerned with zones that players control
based on their current position and velocity (Brefeld, Lasik and Mair 2019).
A major challenge in off-the-ball research is the evaluation of actions. Our approach is
conceptually simple: Using roughly four million spatio-temporal instances, we use machine
learning techniques to predict the velocity (two-dimensional directional vector and speed)
of a defensive player in a given situation. A defensive anticipation metric is then developed
which compares the player’s actual velocity with the predicted velocity of a typical player in
this situation. The interpretation of the defensive anticipation metric is based on the tenet
that fast is better than slow (Blank 2012). Players that excel in this trait may be thought of
as energetic and quick-thinking, and provide a particular benefit to teams. Importantly, this
type of analysis is amenable to other invasion sports for which tracking data are available.
In Section 4.2, we describe the dataset. In Section 4.3, we develop the methods used
to evaluate defensive anticipation. The work is highly computational and we describe our
approach which is based on the use of a tree-based boosting algorithm. In Section 4.4,
the methods are then applied to an analysis of players from the Chinese Super League
where validity and reliability of the approach are demonstrated. We conclude with a short
discussion in Section 4.5.
4.2 Data
Our data consists of matches from the 2019 season of the Chinese Super League (CSL). The
league involved 16 teams where each team played every opponent twice, once at home and
once on the road. From these potential 240 matches, we have three missing matches.
From these 237 matches, event data and tracking data were collected independently
where event data consists of occurrences such as tackles and passes, and these were recorded
along with auxiliary information whenever an “event” takes place. The events were manually
recorded by technicians who view film. Both event data and tracking data have timestamps
so that the two files can be compared for internal consistency. In the CSL dataset, tracking
data were obtained from video and the use of optical recognition software. The tracking
data consists of roughly one million rows per match measured on 7 variables where the
data are recorded every 1/10th of a second. Each row corresponds to a particular player
at a given instant in time. Therefore, we have a big data problem where both event data
and player tracking data are available based on 237 regular season matches. Although the
inferences gained via our analyses are specific to the CSL, we suggest that the methods are
applicable to any soccer league which collects tracking data.
31
4.3 Methods
4.3.1 Rationale of the Approach
Consider a defender at a particular instant in time during a match. Our approach begins
with the prediction of a velocity vector (ŷ1 , ŷ2 ) for the defender. It is important to emphasize
that the two-dimensional velocity vector contains both a directional component and mag-
nitude (i.e. speed). The prediction is facilitated through the availability of tracking data
associated with the 2019 season of the CSL. With this massive dataset, there exist “similar”
circumstances in a spatio-temporal sense to the given situation. Therefore, the prediction
represents the velocity (i.e. speed and direction) of a typical player in the situation of in-
terest. Of course, the observed velocity (y1 , y2 ) of the defender will not be exactly the same
as the predicted velocity (ŷ1 , ŷ2 ). We posit that the defender will have performed above
average if they move quicker than predicted in the predicted direction. The quantification
of performance is formalized in Section 4.3.4. The desirability of moving quickly is a tenet
of many sports, including soccer, and is discussed in Chapter 1 of Blank (2012).
32
lenging for prediction since there are often multiple potential paths which offensive players
may choose.
A first step in the data analysis is the determination of ball possession which then defines
the defensive and offensive teams. In addition to player tracking data, we are also provided
with tagged event data that provides the timing of passes, dribbles, shots, etc. A possession
is retained if the same team maintains the control of the ball by either passing, dribbling
or attempting a shot, and the possession ends when the opponent gains control of the ball,
a penalty occurs, the ball goes out of bounds, etc.
To make the prediction problem more tractable, we introduce two data reductions. First,
we analyse match states every ϵ = 1 seconds. This is a tremendous data reduction (reduction
by a factor of 10) since tracking data are recorded every 1/10th of a second. However, over a
90-minute match this still leaves us with 5,400 potential observations per player per match.
With 11 defensive players on the pitch and the 237 regular season matches, this provides us
with over 14 million records. We view ϵ > 0 as a tuning parameter which we can increase or
decrease to adjust the total number of observations. The data reduction is advantageous in
the sense that player actions are essentially independent for larger values of ϵ. In soccer, a
player’s objectives at a given point in time are different and independent from his objectives ϵ
seconds later for sufficiently large ϵ. Our intuition is that player options change considerably
over states separated by ϵ ≥ 1 second.
Another data reduction decision involves the covariate vector x provided by the tracking
data. Based on our soccer knowledge, we posit that a player’s actions are mostly dependent
on the spatio-temporal characteristics of the ball and the players within their immediate
vicinity. Of course, there are long passes in soccer, but we exclude these considerations as
they are the exception rather than the rule. We therefore introduce the following covariates
for a given defensive player in a particular state:
• x9 - indicator for the player on offensive or defensive side of the field (1-dim)
33
• x10 - indicator for player belonging to the home or road team (1-dim)
Therefore, even though we have dramatically reduced the dimensionality of the tracking
data, we have retained a 61-dimensional covariate which we hope captures the main drivers
of how a player responds in a given situation. We note that the covariates contain a great
amount of information which is related to y in complex ways. For example, if a player is close
to goal, they may behave differently than if they are near midfield. Also, the movements
and space of nearby players naturally impact decisions.
The variable x2 and the associated tuning parameter ∆ ≥ 0 require additional discussion.
We cannot include x2 as a covariate with ∆ = 0 as this would render y = x2 at all times t,
and consequently, any fitting algorithm would yield the useless prediction ŷ = y. That is,
our predicted velocity would not be a typical velocity given the circumstances, but instead,
the observed velocity of the player of interest. However, the observed velocity y of the player
of interest at time t clearly depends on his movement prior to time t. For example, if a player
is moving forward at speed s, it is easier for him to quickly transition to speed s + δ moving
forward than speed s + δ moving backward. In summary, we ought to know about a player’s
movement before time t as this impacts movement at time t. In Section 4.3.3, we investigate
the selection of ∆.
The expected possession value EP V (feature x22 ) was made publicly available by Shaw
(2019). Given the spatial state of a match, EP V provides a measure of the attacking value
34
of each location on the field. We modify the EP V covariate of a player by setting it equal
to zero if the offensive player is offside. This is an important covariate in our analysis since
defenders should be cautious of balls being played to high EP V positions.
There is some redundancy in our covariates. For example, if we know the Cartesian co-
ordinates of two objects, then the distance between these two objects is a function of their
positions. However, to assist the machine learning algorithm of Section 4.3.3, we provide
some of these derived covariates. We have limited the covariates to the three nearest team-
mates and three nearest opponents. In most cases, these are the players who most influence
the movement of the player of interest. The player of interest cannot intervene in locations
that are too distant.
35
Figure 4.1, we choose the tuning parameter ∆ = 0.5 seconds where the correlation r ≈ 0.9.
The fitted model from LightGBM provides a mean absolute error of 0.319 m/sec in the
x-coordinate velocity and 0.398 m/sec in the y-coordinate velocity.
1.0
0.8
Correlation
0.6
0.4
0.2
0 1 2 3 4 5 6 7 8 9
Figure 4.1: Correlation of predicted speed at time t and actual speed at time t − ∆ where
time is measured in seconds. The blue dashed line corresponds to the selected value ∆ = 0.5
seconds.
36
off-the-ball performance at time t by
q q q
v1 + v2 − ŷ1 + ŷ2 / ŷ12 + ŷ22
2 2 2 2 v1 ŷ1 ≥ 0
p= q q q . (4.1)
− v1 + v2 − ŷ1 + ŷ2 / ŷ12 + ŷ22
2 2 2 2 v1 ŷ1 < 0
5
•B
•A
(yobs1, yobs2) k(y^1, y^2)
4
3
y2
(v1, v2)
0 2 4 6 8
y1
Figure 4.2: Geometric diagram which illustrates the components of the statistic p in equation
(4.1). Imagine a player who is located at the origin (0, 0). The observed velocity of the
player is shown by the blue vector pointing towards (2, 4). The predicted velocity of an
average player is shown by the yellow vector pointing towards (8, 4). The perpendicular line
indicates the projection of the observed velocity vector on the predicted velocity vector.
Using equation (4.1), the defensive anticipation value, p, is equal to −0.6, which can be
interpreted as a 60% reduction compared to the average player.
37
The player’s season long performance is then given by the defensive anticipation metric
N
1 X
!
P = pi 100% (4.2)
N i=1
where the summation is taken over all instances where the predicted velocity exceeds the
threshold speed and the index i = 1, . . . , N corresponds to the cases involving the player
during the season. We can think of (4.2) as metric which measures defensive anticipation.
The multiplicative factor 100% in (4.2) permits a nice interpretation; a P -score of +x
describes a player whose defensive anticipation is x% above the average player whereas
a P -score of −x describes a player whose defensive anticipation is x% below the average
player.
38
Figure 4.3: Plot of predicted velocities (purple arrows) and observed velocities (black arrows)
at a given instant in time. The blue team is in possession, the yellow team is defending and
the red dot corresponds to the ball.
4.4.1 Reliability
With respect to a metric, reliability refers to the consistency of the measure. In other words,
reliability addresses reproducibility. For example, it would be undesirable if our defensive
antipation metric (4.2) identified a player as having great defensive anticipation for half
of the matches and terrible defensive anticipation in the other matches. Since we expect
some consistency in professional athletes, this would suggest that there is little value in the
metric.
To investigate this, we divided the 2019 CSL season into even and odd weeks. The
premise is that the metric (4.2) measures an aspect of playing style, and that style should
not differ greatly between the two sets of weeks. In Table 4.1, we provide results for the 10
players on Shandong Luneng for whom the number of instances N > 10, 000 in (4.2) for
both sets of weeks. Shandong Luneng is an interesting CSL team as two of the international
players (Fellaini and Pelle) are well known to those who follow the English Premier League.
We observe that there is consistency in the player metrics across the two sets of weeks. In
fact, the ranks of the 10 players are identical across the two weeks. This suggests that the
defensive anticipation metric (4.2) is reliable and is capturing an aspect of playing style.
4.4.2 Validity
With respect to a metric, validity refers to the accuracy of measure. In our investigation,
we are interested whether the metric P in (4.2) really measures defensive anticipation.
39
Player Neven Nodd Peven (rank) Podd (rank)
Marouane Fellaini 17,146 17,340 2.8 (1) 2.4 (1)
Zhang Chi 16,647 17,235 2.4 (2) 2.0 (2)
Liu Yang 19,556 19,845 1.6 (3) 1.8 (3)
Wang Tong 13,955 20,034 0.4 (4) 0.2 (4)
Hao Junmin 16,050 16,696 -0.3 (5) -1.4 (5)
Zheng Zheng 14,582 10,849 -1.6 (6) -2.6 (6)
Dai Lin 14,030 18,423 -2.1 (7) -3.1 (7)
Graziano Pelle 19,337 18,302 -3.7 (8) -4.1 (8)
Gil 10,159 13,306 -4.1 (9) -5.1 (9)
Roger Guedes 14,067 16,737 -5.5 (10) -5.7 (10)
Table 4.1: The defensive anticipation metric P calculated during even and odd weeks for
players on Shandong Luneng during the 2019 season.
To investigate validity, we first consider the defensive anticipation metric (4.2) for all
438 outfield players in the CSL dataset. The players are categorized according to the five
broad playing positions as follows: wide midfielder (n = 79) wide defender (n = 77), and
forward (n = 86), central midfielder (n = 110) and central defender (n = 86). Density plots
of (4.2) corresponding to each of the playing positions are shown in Figure 4.4. We observe
that there is little difference in (4.2) across the playing positions. We note that central
midfielders have slightly larger values of (4.2) than other players on average (as might be
expected). This may be related to the defensive aggressiveness required at that position.
We also observe that there is more variability in (4.2) amongst the forwards than the other
playing positions.
40
Wide midfielder
Wide defender
Position
Wide midfielder
Position
Wide defender
Forward
Forward
Central midfielder
Central defender
Central midfielder
Central defender
−10 0 10 20
P
Figure 4.4: Density plots of (4.2) based on playing position. For each player, the defensive
anticipation metric (4.2) was calculated for all matches in the 2019 CSL season. We observe
that central midfielders have slightly larger defensive anticipation values than other players
on average, and there is more variability amongst the forwards than the other playing
positions.
Recall that a difficulty in assessing the validity of the proposed metric (4.2) is that there
is no gold standard for the truth. We do not know with certainty which players play with
more and less defensive anticipation (combination of energy and quick-thinking). Therefore,
we took the same players from Shandong Luneng as in Table 4.1, and ranked these play-
ers according to their P -scores (4.2) from the entire 2019 season. The results are provided
in Table 4.2. In Table 4.2, we made comparisons with various measures of aggression. We
provide season long data on fouls, successful tackles and interceptions. We excluded card
accumulation as cards are relatively rare events. We observe that the aggressiveness inher-
ent in fouls, successful tackles and interceptions correlates with our defensive anticipation
metric. For example, the correlation coefficients between P and these three statistics are
0.60, 0.65 and 0.49, respectively.
In Table 4.2, we explored the relationship between P with player interceptions and
tackles in the context of Shandong Luneng. We expanded this investigation by considering
all players in the CSL who had played at least 500 minutes during the 2019 season. Figure
4.5 provides scatterplots relating P to interceptions and tackles. We observe that these
measures of aggression (i.e. interceptions and tackles) correlate with P leaguewise.
41
Player P (rank) Fouls (rank) Tackles (rank) Interceptions (rank)
Marouane Fellaini 2.64 (1) 46 (1) 21 (5.5) 23 (4)
Zhang Chi 2.20 (2) 32 (2.5) 21 (5.5) 29 (2)
Liu Yang 1.71 (3) 26 (4.5) 33 (1) 6 (8)
Wang Tong 0.26 (4) 15 (9) 19 (7) 27 (3)
Hao Junmin -0.85 (5) 25 (6) 23 (4) 22 (5)
Zheng Zheng -1.99 (6) 17 (8) 29 (2) 12 (7)
Dai Lin -2.67 (7) 32 (2.5) 24 (3) 33 (1)
Graziano Pelle -3.91 (8) 26 (4.5) 6 (10) 2 (9.5)
Gil -4.65 (9) 6 (10) 13 (8) 13 (6)
Roger Guedes -5.63 (10) 21 (7) 7 (9) 2 (9.5)
Table 4.2: The defensive anticipation metric P given by (4.2) for 10 players on Shandong
Luneng who received the most playing time during the 2019 CSL season. We also provide
comparison metrics involving aggression during the 2019 season, namely the total number
of fouls committed, tackles made and the number of interceptions.
5 5
0 0
P
−5 −5
R = 0.32 R = 0.29
−10 −10
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Interceptions per game Tackles won per game
Figure 4.5: Scatterplots of the defensive anticipation metric (4.2) plotted against player
interceptions and tackles made during the 2019 CSL season.
We investigated the validity of our metric further by calculating the average P -score for
all CSL players where we divided matches into 10-minute intervals. The plot is provided
in Figure 4.6. We observe that P decreases as the match progresses. Since players tire as
the game proceeds (both physically and mentally), it makes sense that our metric (4.2)
decreases. There appears to be a big drop after the 70-th minute of the match. Perhaps
we could introduce alternative metrics to explore the dropoff in defensive anticipation. For
example, it may be valuable to know which players have greater drops in performance and
when these occur.
It is interesting that amongst CSL players with regular minutes, the two players with the
highest P -scores are Chang Feiya of Wuhan Zall (P = 5.71) and Yang Shiyuan of Shanghai
42
1
−1
−2
[0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] 90+
Minute
Figure 4.6: Plot of the defensive anticipation metric (4.2) averaged over all CSL players
during 10-minute intervals.
SIPG (P = 5.33). Feiya is primarily a midfielder and does not have remarkable statistics; he
scored only one goal in the 2019 season. Interestingly, the website https://www.allfamous
birthday.com/chang-feiya/ describes Feiya as one of the most popular Chinese football
players. Shiyuan is a midfielder who also does not have remarkable statistics; he did not score
during the 2019 season. Interestingly, the website https://www.whoscored.com/Players/
143864/Show/Yang-Shiyuan describes Shiyuan as a player who likes to tackle and commits
fouls often.
4.5 Discussion
We have introduced an important and seminal area of research where automatic and ob-
jective methods have been developed to assess a particular defensive characteristic of off-
the-ball behaviour. We have referred to the proposed metric (4.2) as defensive anticipation.
The methods can be adapted to any invasion sport where tracking data are available.
The evaluation of off-the-ball performance is viewed in a narrow context where fast is
considered better than slow. Even if speed is not the ultimate metric in off-the-ball evalu-
ation, the metric (4.2) developed here may uncover insights into aspects of play. Perhaps
players with high evaluations may be thought of as “high motor” players whose skills are
useful to teams. An important aspect of the research is that our metric measures aspects of
industry, laziness, anticipation and quick-thinking; these are characteristics that have not
been previously quantified.
Some other notable aspects of our work include the following: the proposed metric is
seen as reliable in the sense that it truly captures intrinsic player tendencies (Table 4.1),
43
the metric adheres to expected results such as the positive correlation between the metric
and other statistics related to aggression (Figure 4.5), and decreasing defensive anticipation
as players tire (Figure 4.6).
44
Chapter 5
Pitch Control
5.1 Introduction
Pitch control (field ownership) has become an increasingly discussed topic in sports analyt-
ics. At a given point in time during a match, each location on the field may be “dominated”
by a particular player. Maps that depict this domination are referred to as pitch control
diagrams and the construction of these maps is the subject of this paper.
The determination of pitch control is an important component in the investigation of
tactics. For example, related to pitch control is the quantification of scoring opportunities
from off-the-ball positions in soccer (Spearman 2018). In turn, this provides an analysis
of optimal player movement, player tendencies, the evaluation of player movement and
attacking capability (Link, Lang and Seidenschwarz 2016). In the National Football League
(NFL), pitch control covariates have proven useful in the estimation of catch probabilities
(Reyer and Swartz 2021).
The availability of tracking data is fundamental to the analysis of pitch control. Tracking
data in soccer consists of the (x, y) coordinates of the ball and the 22 players on the pitch
recorded at regular and frequent time intervals. With tracking data, we know the locations
of all of the players at all times during a match, and this facilitates the determination of
pitch control. A feature of our research is that the determination of pitch control is easily
automated whenever teams have access to tracking data. Gudmundsson and Horton (2017)
provide a review paper on spatio-temporal analyses used in invasion sports (including soccer)
where tracking data are available. Goes et al. (2021) discuss a range of tactical problems in
soccer based on the use of tracking data. For example, the visualization of team formations
is a problem that has received particular attention (Wu et al. 2019). The analysis of tracking
data has also been prominent in the sport of basketball; see for example, Miller et al. (2014).
For a review of statistical contributions that have been made across major sports, see the
text by Albert et al. (2017).
In Section 5.2, we provide a literature review of prominent approaches that have been
utilized in the construction of pitch control diagrams. Commentary is provided that discusses
45
the relative strengths and weaknesses of the approaches. In Section 5.3, we introduce a new
pitch control metric which is illustrated in the context of soccer but is adaptable to other
invasion sports. A motivation of the approach is that it considers relevant soccer dynamics.
That is, we consider the location, speed and direction of both the ball and the players in
anticipation of their movement. The approach is conceptually straightforward compared to
some of the approaches that have been introduced in the literature. This is an important
as it allows users to vary parameters to suit their application. In Section 5.4, we implement
the proposed pitch control metric in the context of an example. In Section 5.5, we evaluate
the accuracy of the proposed pitch control metric, an analysis that is novel in the pitch
control literature. We conclude with a short discussion in Section 5.6.
Whereas some of the code that has been developed in the pitch control literature is
proprietary, we have provided a function that allows users to develop applications based on
our version of pitch control. The code is available in the appendix.
46
Pitch control as determined by Voronoi diagrams has been utilized by various investiga-
tors in soccer including Kim (2004). However, there are some important limitations of the
Voronoi diagrams as they do not consider relevant features of the corresponding sporting
application. For example, a player could be moving in a particular direction. This is rele-
vant because it is easier for the player to control regions in the direction of travel and it is
more difficult to control regions that are opposite to the direction of travel. Clearly, player
velocities impact pitch control and this is not addressed with Voronoi diagrams.
1.0
0.8
0.6
y
0.4
0.2
0.0
Figure 5.1: Voronoi diagram based on n = 5 points generated on the unit square.
In Figure 5.2, we provide a Voronoi diagram that is based on the instantaneous player
positions in an actual soccer match. The coloring no longer refers to individual players, but
rather the two teams. Consequently, we observe that the field is partitioned into two sets.
We also observe that there is a lack of smoothness in the regions based on the Voronoi
construction.
47
Team Road Home
15
15 5
20
21
10
22
25 2
6
9
16 5 24 14
9
10
2 19
27
Figure 5.2: Voronoi diagram applied to a given snapshot of a soccer game based on the
location of the 22 players on the pitch. The shaded orange and purple areas correspond the
dominant regions for the home and away teams, respectively.
Prior to the advent of commercial tracking data, Taki, Hasegawa and Fukumura (1996)
developed a motion analysis system whereby cameras and software were utilized to extract
features from a soccer match. In particular, they introduced the “dominant region” concept
which addresses pitch control. Unlike Voronoi diagrams, the dominant regions of Taki,
Hasegawa and Fukumura (1996) are not polygons, but areas delineated by smooth curves.
A dominant region for a player is defined as the set of points where the player can arrive
earlier than all other players. A player’s time of arrival at a given point takes into account the
player’s current location, current speed and potential acceleration. The acceleration model is
not fully disclosed in Taki, Hasegawa and Fukumura (1996) but it is based on patterns from
an average player and permits reduced accelerations in the direction of movement. It also
appears that the calculation of arrival time does not utilize a cap on player velocities which
seems to be a practical limitation of the approach. Fujimura and Sugihara (2005) address
the limitation of maximal velocity in their modified approach to pitch control. A notable
difference between the dominant region approach and the metric proposed in Section 5.3
is that Taki, Hasegawa and Fukumura (1996) do not take the dynamics of the ball into
account. For example, there could be a location which player A can reach in time tA and
player B can reach in time tB where tA < tB . However, it would be incorrect to assert that
player A has dominance at this location if it takes time t > tB for the ball to reach the
location. In this case, neither player has unique control over the location. Gudmundsson
48
and Wolle (2014) extend these ideas where individual player characteristics are estimated
from data. Taki and Hasegawa (2000) provide more detail on the dominant region approach
and use the framework to evaluate teamwork in soccer.
Brefeld, Lasek and Mair (2019) provide a probabilistic approach to the construction
of pitch control diagrams based on machine learning methods. They refer to this area of
research as movement models. The movement model for a given player is described by the
conditional density function
Pt∆ (p | pt , vt ) (5.1)
where p is the location attained during time horizon t∆ . The density (5.1) is conditional on
the player’s position pt and velocity vt at the current time t. As locations refer to positions on
the field, p, pt and vt are two-dimensional variables. Brefeld, Lasek and Mair (2019) suggest
various algorithms for the estimation of the conditional density (5.1) and its corresponding
discrete approximations. The algorithms are computationally intensive and use historical
data to inform player movement. In contrast to the approach developed in Section 5.3, the
conditional density (5.1) does not consider various aspects of the state of the game (e.g.
ball position, ball movement, and current player acceleration). It is also important to note
that not all historical movement data is pertinent to a given situation. For example, players
move slowly in situations where they are not active, and these movements are not relevant
to situations where they need to be active.
Having estimated the conditional density (5.1) for every player, Brefeld, Lasek and Mair
(2019) then define the “zone of control” for a given player as the locations on the field where
the player’s density (5.1) is greater than the density for any other player. The intuitive idea
is that the zone is attained with higher probability than other players.
Fernández and Bornn (2018) also use statistical methods to determine pitch control
where their movement models are based on parametric distributions. For the ith player,
they define an influence degree Ii (p, t) for location p at a future time t. The influence degree
is a density ratio corresponding to a bivariate normal distribution that takes into account
various soccer dynamics such as current player velocity. Field ownership by a team is then
assessed through a kernel-based non-parametric point process which considers the cumu-
lative influence degrees of all players on the field. An important feature of their approach
is that the method elicits degrees of field ownership rather than a binary outcome. It is
also important to note that the degree of field ownership is not a probability, and that
the Gaussian distributions do not permit skewness. It would be interesting to calibrate the
degrees of field ownership with empirical probabilities of field ownership. Martens, Dick and
Brefeld (2021) provide additional critique on the approaches developed by Brefeld, Lasik
and Mair (2019) and Fernández and Bornn (2018).
49
William Spearman has developed pitch control models for the Liverpool Football Club
where ideas are sketched out in a conference presentation (Spearman 2016) and also in the
YouTube video https://www.youtube.com/watch?v=X9PrwPyolyU. The approach is also
probabilistic and is based on the estimated time ti that it takes the ith player to reach a given
location. Players are labelled li = 1(0) according to whether they play on the home(road)
team. The probability that the home team ends up with possession at the given location is
given by
−β
P !
i l i ti
Prob(Home) = P −β + 1 /2 (5.2)
i ti
where pass impact is the probability of scoring a goal from the given location. Pass impact
is determined by historical data and soccer considerations such as the location on the field,
the relative position of defenders, etc. Of course, the simple formula given by (5.3) does not
take into account the difficulty of executing the pass successfully. A more sophisticated way
to build an EPV framework is discussed in Cervone et al. (2016) and Fernández, Bornn and
Cervone (2021).
All of the approaches reviewed in this section have a common element. The commonality
is that there is an estimated minimal time for each player to reach a location. This is even
true in the case of Voronoi diagrams where the minimal time is proportional to distance
under the assumption that all players travel at a constant speed. Ownership of the location
is then determined by assessing these minimal times by providing more weight to player
ownership for players that can reach the location faster. In the following section, we take
a philosophically different approach where it is asserted that if two players can reach a
location by the time of ball arrival, neither player has a claim or advantage involving field
50
ownership. This results in a metric that is relatively simple and provides regions of team
dominance that do not form a partition of the field. Rather, there are regions of dominance
and contested areas. This type of mapping may prove advantageous in some applications.
51
tb < th + ϵ < tr . This states that the player on the home team must have a little bit of time
to gain possession (relative to the arrival of the road player) in order for pitch control to be
established. For example, we might set ϵ = 0.5 seconds for professional soccer players. The
consequence of such modifications is to partition the field with larger regions corresponding
to “neither team”. The introduction of free spaces that are controlled by neither team has
also been investigated by Caetano et al. (2021).
Table 5.1: The determination of pitch control at a given location given time inequalities
involving tb , th and tr .
In what follows, we use the Cartesian coordinate system where locations, velocities and
accelerations are described by ordered pairs. Distances and times are measured in metres
and seconds, respectively.
Shaw (2020) provide speeds of 15 metres/sec for balls that are passed between team-
mates. Maximum speed of kicked soccer balls has been estimated as high as 30 metres/sec
(https://soccerballworld.com/soccer-ball-physics/). For this application, we have set the
speed of the ball at s = 18 metres/sec based on an investigation of all completed passes
during the 2019 season of the Chinese Super League (CSL).
This leads to the quantity of interest
q
tb = (x1 − x0 )2 + (y1 − y0 )2 /s (5.4)
which is based on historical data using the approximation that a kicked ball travels at a
constant speed.
52
5.3.3 Timing of Players
The time that it takes a player to reach the location of interest is a more complex calculation
than the time tb in (5.4) for the ball to reach the location of interest. Although there are
n = 22 players on the soccer pitch, we use simplified notation where we suppress player
subscripts. We consider a single player where we use t to denote the time that it takes the
player to travel from their current location to the location of interest with intent. We further
define
Before proceeding with the calculation of the time t that it takes the player to reach the
location of interest, we discuss some of the assumptions related to our motion model. We
label these assumptions and the associated discussions (A) - (C).
Assumption A regarding (vx0 , vy0 ): We obtain the current velocity (vx0 , vy0 ) directly
from the tracking data. The current player velocity may be approximated from the tracking
data by considering the change in location by the player during a small time increment
surrounding the current time; e.g. vx0 = ∆x0 /∆t where ∆x0 is the distance travelled in the
x-coordinate direction in a window of time length ∆t surrounding the current time. Some
exploratory procedures for accurately estimating speed in tracking data are discussed by
Wu and Swartz (2022).
53
contrast, the 356th player on the list is Jiri Pavlenka of Werder Bremen with a top speed
of 31.0 km/hour. For illustration in the remainder of the paper, we set the common value
smax = 9.2 metres/sec which corresponds to 33.1 km/hour. For comparison, we note that
Shaw (2020) uses s = 5.0 metres/sec, Fernández and Bornn (2018) use s = 13.0 me-
tres/sec and Brefeld, Lasik and Mair (2019) use smax = 8.0 metres/sec for maximum speed.
With the two-dimensional representation of velocity, we therefore introduce the constraint
vx2t + vy2t ≤ s2max for all times t.
80
90
60
60
count
count
40
30
20
0 0
6 7 8 9 10 4 6 8
Maximum speed Maximum acceleration
Figure 5.3: The distribution of maximum speed and maximum acceleration of all players in
the Chinese Super League in 2019.
54
possible. There is support for the constancy of acceleration (at least for short time periods)
as displayed in nearly linear velocity curves for sprinters (Chatzilazaridis, Panoutsakopoulos,
and Papaiakovou 2012).
We now return to the calculation of the time t that it takes the player to travel from the
current location (x0 , y0 ) to the location of interest (x1 , y1 ). The first step is the determination
of the time t∗ that it takes the player to reach the prescribed maximum speed smax , where
t∗ is a function of the acceleration (ax , ay ) profile. Therefore, consider an acceleration vector
(ax , ay ) which lies on the circle a2x + a2y = a2 . Given the acceleration, maximum player speed
is achieved at time t∗ when
where negative solutions in time are non-sensical. It can be proven that there is a unique
solution t∗ > 0.
Based on the stated motion assumptions, the location of the player at time t is therefore
given by
Rt
x0 + (vx0 + ax t) dt
(
t < t∗
xt = R0t∗ Rt
x0 + 0 (vx0 + ax t) dt + t∗ (vx0 + ax t∗ ) dt t > t∗
x0 + vx0 t + (1/2)ax t2
(
t < t∗
= (5.7)
x 0 + v x0 t∗ + (1/2)ax t2∗ + (t − t∗ )(vx0 + ax t∗ ) t > t∗
and similarly,
y0 + vy0 t + (1/2)ay t2
(
t < t∗
yt = . (5.8)
y0 + vy0 t∗ + (1/2)ay t2∗ + (t − t∗ )(vy0 + ay t∗ ) t > t∗
Equations (5.7) and (5.8) present an algorithm for computing t by setting the location
of interest (x1 , y1 ) = (xt , yt ). The algorithm proceeds by stepping through acceleration
vectors (ax , ay ) according to the constraint a2x + a2y = a2 . For a given (ax , ay ), we determine
t∗ = t∗ (ax , ay ) according to equation (5.7). Having solved for t∗ , this simplifies equations
(5.7) and (5.8). For equation (5.7), we have the solution t(x) , and for equation (5.8), we
have the solution t(y) . If t(x) = t(y) , this means that for the acceleration vector (ax , ay ),
the player arrives at the coordinates x1 and y1 at the same time, and we have a solution
t = t(x) = t(y) . If there are multiple solutions for different values of (ax , ay ), then we select
55
the minimum time according to the assumption that players go to the location of interest
with intent.
To operationalize the algorithm, suppose that for a given (ax , ay ), we determine the
unique solution t∗ according to equation (5.7). Then, if t < t∗ and ax ̸= 0, we solve a
quadratic equation and obtain
q
−vx0 ± vx20 − 2ax (x0 − x1 )
t(x) = . (5.9)
ax
x1 − x0 + (1/2)ax t2∗
t(x) = . (5.11)
v x0 + a x t∗
Analogous equations to (5.9), (5.10) and (5.11) are available for t(y) . Non-sensical solutions
t(x) and t(y) imply that it is not possible for the player to reach the location of interest
(x1 , y1 ) under the given acceleration vector (ax , ay ).
5.4 An Example
In the development of our pitch control metric in Section 5.3, we emphasized that control
of the pitch needs to be unambiguous. That is, a player on the team on possession must
be able to reach a location before a player on the opposing team, and the ball must be
delivered in a timely fashion. Consequently, our regions of dominance are typically smaller
than alternative pitch control diagrams.
Using the tracking data, we computed the current velocity vector (vx0 , vy0 ) for the same
example as displayed in Figure 5.2. The arrows indicating velocity are depicted in Figure
5.4. We observe that some of the velocities are large and some are small; this together with
the relative positioning of players determines the resultant pitch control diagram.
56
Team Road Home
15
15 5
20
21
10
22
25 2
6
9
16 5 24 14
9
10
2 19
27
Figure 5.4: Current velocity vectors for the example depicted in Figure 5.2.
In Section 5.3.3, we presented a motion model that derived the time t that it takes
a player to reach the location (x1 , y1 ) on the pitch given the initial location (x0 , y0 ) and
given initial velocity (vx0 , vy0 ). In Figure 5.5, the time is presented for both a stationary
player (left plot) and a player with a northwest velocity (right plot). In the left plot, we
observe colors that radiate in circles such that the player can reach any location of constant
distance in the same amount of time. This corresponds to the Voronoi tessellations. We
observe non-circular color contours in the right plot where the player can reach positions in
the northwest quicker than in other directions of similar distance. The right plot introduces
the reality of players having initial velocities that impact the time that it takes to reach
various locations.
57
Time to reach in sec Time to reach in sec
0 3 6 9 12 0 3 6 9 12
Figure 5.5: The left plot uses colors to depict the time that it takes a stationary player
to reach field locations given the current location marked with a dot. The right plot does
likewise but introduces an initial velocity (arrow) for the player.
To determine pitch control regions, we discretize the soccer field (105x68) metres into
1-by-1 metre grids and compute the time taken to reach the centre of each grid for each
player. We have set the tuning parameter ϵ = 0.5 seconds as described in Section 5.3.1
which requires that players arrive at least 1/2 second earlier than opponents to achieve pitch
control. The time variables tb , th and tr are computed according to the methods described
in Sections 5.3.2 and Sections 5.3.3, and Table 5.1 is used to determine the pitch control
regions. Figure 5.6 provides the resultant pitch control map. When comparing Figure 5.2
(Voronoi) with Figure 5.6 (proposed approach), we observe considerable differences. First,
Figure 5.6 has grey ambiguous areas where neither team has pitch control whereas the
Voronoi diagram does not. This is sensible as there are locations which players on both
teams can reach before the ball, and therefore neither team can lay claim to the location.
There are also grey locations which a player on one team cannot reach ϵ = 0.5 seconds in
advance of the opponent. If we look closely at Figure 5.6, we see that initial velocities play
an important role in the determination of the pitch control diagram. For example, there is
a location immediately southwest of player #21 on the road team. Yet, player #25 on the
home team can reach this location quicker even though he is further away. The reason is
that player #25 is moving towards the location whereas player #21 is moving away and
needs to reverse direction.
58
Team Road Home Neither
15
15 5
20
21
10
22
25 2
6
9
16 5 24 14
9
10
2 19
27
Figure 5.6: Pitch control diagram using the proposed methods for the example depicted in
Figure 5.2.
5.4.1 Computation
It takes approximately 0.2 seconds on a laptop computer to evaluate a pitch control decision
for a given location on the field. Whereas this seems reasonable, pitch control applications
become computationally intensive as there are typically many locations of interest and many
temporal-spatial snapshots of interest involving the initial locations and velocities of the 22
players. Fortunately, parallel processing may be implemented for the repeated tasks. For a
given target location on the field, we perform a grid search over 2000 combinations of ax and
ay subject to the constraint a2x +a2y = a2 to find the pair (ax , ay ) that gives the shortest time
t to reach the target location. For a pitch control diagram, the computational time is heavily
affected by the number of grids specified for the soccer field. By halving the size of grids,
we could reduce the computation time by a factor of four. More sophisticated optimization
algorithms for the evaluation of t could be considered in future work. Such algorithms may
avoid the consideration of unpromising acceleration pairs (ax , ay ). It may also be possible
to introduce computing efficiencies by eliminating some players in the determination of
minimum time t to reach a location. Players who are distant from the location of interest
cannot reach the location in minimum time.
59
5.5 Accuracy of the Metric
Whereas the literature has introduced various approaches to pitch control, the literature is
sparse on validation. Validation is particularly difficult when color-codings are not proba-
bilistic. In our approach, we are able to investigate validation as the field is segmented in
three regions according to ownership by the two teams and neither team.
We sampled 10 games from the 2019 season of the CSL and obtained data on 7901
intended passes. We first classified the passes as either successful or intercepted. We then
calculated our pitch control metric for each of these passes and further classified the des-
tination location as either controlled by the intended team, the opponent or neither team.
The results of the two-way classification are given in Table 5.2.
An initial observation from Table 5.2 is that there are naturally more successful passes
(6826) than intercepted passes (1075). This corresponds to a successful pass rate of 86%. If
we omit the “neither team” designation, there are 5887 passes of which 5275+55 = 5330 are
controlled by the predicted team according to the pitch control model. This is suggestive
of a 91% accuracy rate in pitch control designation. However, we keep in mind that this
is actually a conservative figure. For example, there may be some passes that arrive at a
player’s feet and should be controlled. However, by some technical error on the part of
the player, the opponent gains control. When we look at the passes whose pitch control
designation is “neither team”, we observe that this corresponds to 25% of the passes (i.e.
1470 + 544 = 2014 passes out of 7901). These cases are truly more doubtful, as only 27% of
them are received as intended (i.e. 1470 passes out of 2014). We emphasize that our model
provides two tuning parameters that allow us to increase/decrease the number of passes
that are classified as “neither team”. By increasing the speed s of the ball (Section 5.3.2),
the ball will rarely lag the players to the location of interest, and consequently, we will
reduce the size of ambiguous regions according to Table 5.1. Also, by increasing the time ϵ
that a player needs to arrive before the opponent (Section 5.3.2), this will increase the size
of ambiguous regions.
Table 5.2: The classification of 7901 intended passes according to whether pitch control
(PC) was designated to the intended team, the opponent or neither team.
60
5.6 Discussion
From the original work on pitch control established via Voronoi tessellations, there has
been various attempts to define field ownership. In this manuscript, we have provided a
motion model and heuristics (Table 5.1) that are straightforward but adhere (at least ap-
proximately) to the physics of running. A difference between our approach and most of the
methods proposed in the literature is that we define regions corresponding to the home
team, the road team and neither team. This allows us to validate the accuracy of pitch
control diagrams (Section 5.5) whereas this is not possible with approaches that provide
non-probabilistic color-codings. In addition, unlike some of the proprietary methods for
pitch control, we have provided code for determining pitch control regions.
Our pitch control model offers opportunities to enhance/modify the approach. For ex-
ample, it is possible to introduce player-specific maximum velocities and player-specific ac-
celerations. It is also possible to vary the ϵ parameter which dictates the additional amount
of time needed by a player to gain control over the opponent. It is also possible to vary
the speed of the ball. In particular, it may be reasonable to use historical data to associate
faster ball speeds with longer distances.
61
Bibliography
Albert, J.A., Glickman, M.E., Swartz, T.B. and Koning, R.H., Editors (2017). Handbook
of Statistical Methods and Analyses in Sports, Chapman & Hall/CRC Handbooks of
Modern Statistical Methods, Boca Raton.
Austin, P.C. (2011). An introduction to propensity score methods for reducing the effects of
confounding in observational studies. Multivariate Behavioral Research, 46, 399-424.
Bransen, L., Van Haaren, J. and van de Velden, M. (2019). Measuring soccer players’ con-
tributions to chance creation by valuing their passes. Journal of Quantitative Analysis
in Sports, 15(2), 97-116.
Brefeld, U., Lasek, J. and Mair, S. (2019). Probabilistic movement models and zones of
control. Machine Learning, 108, 127-147.
Buchheit, M., Samozino, P., Glynn, J.A., Michael, B.S., Al Haddad, H., Mendez-Villanueva,
A. and Morin, J.B. (2014). Mechanical determinants of acceleration and maximal
sprinting speed in highly trained young soccer players. Journal of Sports Sciences,
32(20), 1906-1913.
Bundesliga (2019). xG stats explained: The science behind Sportec Solutions’ expected
goals model. Accessed on July 10, 2020 at https://www.bundesliga.com/en/bundesliga/
news/ expected-goals-xg-model-what-is-it-and-why-is-it-useful-sportec-solutions-3177
Caetano, F.G., Barbon Junior, S., Torres, R da S., Cunha, S.A., Ruffino, P.R.C., Martins,
L.E.B. and Moura, F.A. (2021). Football player dominant region determined by a novel
model based on instantaneous kinematics variables. Scientific Reports, 2(1), 1-10.
Cervone, D., D’Amour, A., Bornn, L. and Goldsberry, K. (2016). A multiresolution stochas-
tic process model for predicting basketball possession outcomes. Journal of the Amer-
ican Statistical Association, 111(514), 585-589.
62
Cheong, L.L., Zeng, X. and Tyagi, A. (2021). Prediction of defensive player trajectories in
NFL games with defender CNN-LSTM Model. 15-th MIT Sloan Sports Analytics Con-
ference, Accessed November 20, 2021 at https://global-uploads.webflow.com/5f1af76ed
86d6771ad48324b/607a44743e03939e1d87267a_LinLeeCheong-Defensive-PlayerNFL
-RPpaper.pdf
Coleman, J. (2018). Revealed: The 20 fastest players in the Premier League, including
Kyle Walker and Raheem Sterling. talkSPORT Football, accessed August 7, 2022 at
https://talksport.com/football/348058/20-fastest-players-premier-league-walker/
Decroos, T., Bransen, L., Van Haaren, J., and Davis, J. (2019). Actions speak louder
than goals: Valuing player actions in soccer. Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining.
Dick, U. and Brefeld, U. (2019). Learning to rate player positioning in soccer. Big Data,
7, 71-82.
Fernández, J. (2022). A framework for the analytical and visual interpretation of complex
spatiotemporal dynamics in soccer. Department of Computer Science, Polytechnic
University of Catalonia. Accessed March 16, 2022 at https://upcommons.upc.edu/
handle/2117/363073
Fernández, J. and Bornn, L. (2018). Wide open spaces: A statistical technique for mea-
suring space creation in professional soccer. In 12th Sloan Sports Analytics Con-
ference, Accessed on May 14, 2020 at http://www.sloansportsconference.com/wp-
content/uploads/2018/03/1003.pdf
Fernández, J., Bornn, L. and Cervone, D. (2019). Decomposing the immeasurable sport:
A deep learning expected possession value framework for soccer. In 13th Sloan Sports
Analytics Conference, Accessed on May 14, 2020 at http://www.sloansportsconference
.com/wp-content/uploads/2019/02/Decomposing-the-Immeasurable-Sport.pdf
Fernández, J., Bornn, L. and Cervone, D. (2021). A framework for the fine-grained eval-
uation of the instantaneous expected value of soccer possessions. Machine Learning,
110(6), 1389-1427.
Ferrari Bravo, D., Impellizzeri, F.M., Rampinini, E., Castagna, C., Bishop, D. and Wisloff,
U. (2008). Sprint vs. interval training in football. International Journal of Sports
Medicine, 29(8), 668-674.
63
Fujimura, A. and Sugihara, K. (2005). Geometric analysis and quantitative evaluation of
sport teamwork. Systems and Computers in Japan 36(6), 49-58.
Goes, F.R., Meerhoff, L.A., Bueno, M.J.O., Ridrigues, D.M., Moura, F.A., Brink, M.S.,
Elferink-Gemser, M.T., Knobbe, A.J., Cunha, S.A., Torres, R.S. and Lemmink, K.A.P.M.
(2021). Unlocking the potential of big data to support tactical performance analysis in
professional soccer: A systematic review. European Journal of Sports Science, 21(4),
481-496.
Hodrick, R. and Prescott, E.C. (1997). Postwar U.S. business cycles: An empirical inves-
tigation. Journal of Money, Credit and Banking, 29(1), 1-16.
Imbens, G.W. (2004). Nonparametric estimation of average treatment effects under exo-
geneity: A review. The Review of Economics and Statistics, 86, 4-29.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T.-Y. (2017).
Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural In-
formation Processing Systems, 30, 3146–3154.
Kim, S. (2004). Voronoi analysis of a soccer game. Nonlinear Analysis: Modelling and
Control, 9(3), 233-240.
King, G. and Nielsen, R. (2019). Why propensity scores should not be used for matching.
Political Analysis, 27, 435-454.
Le, H.M., Carr, P., Yue, Y. and Lucey, P. (2016). Data-driven ghosting using deep imitation
learning. 10-th MIT Sloan Sports Analytics Conference, Accessed November 30, 2020
at https://global-uploads.webflow.com/5f1af76ed86d6771ad48324b/5fee0a8b98387922
27ec7fa5_Data-Driven%20Ghosting%20using%20Deep%20Imitation%20Learning.pdf
Le, H.M., Yue, Y., Carr, P. and Lucey, P. (2017). Coordinated multi-agent imitation learn-
ing. Proceedings of the 34th International Conference on Machine Learning, Sydney,
Australia.
Lewis, M. (2013). Moneyball: The Art of Winning an Unfair Game, WW Norton, New
York.
Link, D. and Hoernig, M. (2017). Individual ball possession in soccer, PLoS ONE, 12(7):
e0179953. https://doi.org/10.1371/journal.pone.0179953
64
Link, D., Lang, S. and Seidenschwarz, P. (2016). Real time quantification of dangerousity
in football using spatiotemporal tracking data. PLoS ONE, 11(12), 1-16.
Liu, H., Gomez, M.A., Lago-Penas, C. and Sampaio, J. (2015). Match statistics related
to winning in the group stage of 2014 Brazil FIFA World Cup. Journal of Sports
Sciences, 33(12), 1205-1213.
Llana, S., Madrero, P. and Fernández, J. (2020). The right place at the right time: Advanced
off-ball metrics for exploiting an opponent’s spatial weaknesses in soccer. 14-th MIT
Sloan Sports Analytics Conference, Accessed September 21, 2020 at https://global-
uploads.webflow.com/5f1af76ed86d6771ad48324b/5f6a69841d1ac99fa3a71a41
_Llana_The-right-place-at-the-right-time.pdf
Lowe, Z. (2013). Lights, cameras, revolution. Grantland, Accessed August 25, 2020 at
https://grantland.com/features/the-toronto-raptors-sportvu-cameras-nba-analytical-
revolution/
Mara, J., Morgan, S., Pumpa, K. and Thompson, K.G. (2017). The accuracy and reliability
of a new optical player tracking system for measuring displacement of soccer players.
International Journal of Computer Science in Sport, 16(3), 175-184.
Martens, F., Dick, U. and Brefeld, U. (2021). Space and control in soccer. Frontiers in
Sports and Active Living, 175.
Massard, T., Eggers, T. and Lovell, R. (2017). Peak speed determination in football: Is
sprint testing necessary? Science and Medicine in Football, 2(2), 1-4.
Memmert, D., Lemmink, K.A.P.M. and Sampaio, J. (2017). Current approaches to tactical
performance analyses in soccer using position data. Sports Medicine, 47(1), 1-10.
Miller, A., Bornn, L., Adams, R.P. and Goldsberry, K. (2014). Factorized point process
intensities: A spatial analysis of professional basketball. In Proceedings of the 31st
International Conference on Machine Learning - Volume 32, JMLR.org, Beijing, 235-
243.
Morgulev, E., Azar, O.H. and Lidor, R. (2018). Sports analytics and the big-data era.
International Journal of Data Science and Analytics, 5(4), 213-222.
Oberstone, J. (2009). Differentiating the top English Premier League football clubs from
the rest of the pack: Identifying the keys to success. Journal of Quantitative Analysis
in Sports, 5(3), Article 10.
Oliva-Lozano, J.M., Fortes, V., Krustap, P. and Muyor J.M. (2020). Acceleration and
sprint profiles of professional male football players in relation to playing position.
PLOS ONE, 15(8): e0236959. https://doi.org/10.1371/journal.pone.0236959
65
Pearl, J. (2009). Causality: Models, Reasoning, and Inference, Second Edition, Cambridge
University Press: New York.
Peng, K., Clarke, D.C. and Swartz, T.B. (2022). Bayesian approaches for critical velocity
modelling of data from intermittent efforts. International Journal of Sports Science
and Coaching, 17(4), 868-879.
Pino-Ortega, J., Oliva-Lozano, J.M., Gantois, P., Nakamura, F.Y. and Rico-González,
M. (2022). Comparison of the validity and reliability of local positioning systems
against other tracking technologies in team sport: A systematic review. Proceedings of
the Institution of Mechanical Engineers, Part P: Journal of Sports Engineering and
Technology, 236(2), 73-82.
Power, P., Ruiz, H., Wei, X. and Lucey, P. (2017). Not all passes are created equal: Ob-
jectively measuring the risk and reward of passes in soccer from tracking data. In
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, Halifax, 1605-1613.
Reyers, M. and Swartz, T.B. (2021). Quarterback evaluation in the National Football
League using tracking data. AStA Advances in Statistical Analysis.
Sarkar, S. and Chakraborty, S. (2018). Pitch actions that distinguish high scoring teams:
Findings from five European football leagues in 20115-16. Journal of Sports Analytics,
4, 1-14.
Schuckers, M. and Curro, J. (2013). Total hockey rating (THoR): A comprehensive statis-
tical rating of National Hockey League forwards and defensemen based upon all on-ice
events. Proceeding of the 2013 MIT Sloan Sports Analytics Conference, Accessed on
February 26, 2019 at http://statsportsconsulting.com/thor/
Seidl, T., Cherukumudi, A., Hartnett, A., Carr, P. and Lucey, P. (2018). Bhostgusters:
Realtime interactive play sketching with synthesized NBA defenses. 12-th MIT Sloan
Sports Analytics Conference, Accessed August 25, 2020 at http://www.sloansports
conference.com/wp-content/uploads/2018/02/1006.pdf
Sekhon, J.S. (2011). Multivariate and propensity score matching software with automated
balance optimization: The matching package for R. Journal of Statistical Software,
42, 1-52.
66
Shaw, L. (2020). Advanced football analytics: Building and applying a pitch control model
in Python. Friends of Tracking, YouTube video accessed February 25/21 at https://www.
youtube.com/watch?v=5X1cSehLg6s
Shen, E., Santo, S. and Akande, O. (2022). Analyzing pace-of-play in soccer using spatio-
temporal event data. Journal of Sports Analytics, 8(2), 127-139.
Spearman, W. (2016). Quantifying pitch control. 2016 OptaPro Analytics Forum, DOI:
10.13140/RG.2.2.22551.93603
Spearman, W. (2018). Beyond expected goals. 12-th MIT Sloan Sports Analytics Con-
ference, Accessed September 21, 2020 at http://www.sloansportsconference.com/wp-
content/ uploads/2018/02/2002.pdf
Spearman, W., Basye, A., Dick, G., Hotovy, R. and Pop, P. (2017). Physics-based model-
ing of pass probabilities in soccer. MIT Sloan Sports Analytics Conference, Accessed on
December 14, 2020 at https://www.researchgate.net/publication/315166647_Physics-
Based_Modeling_of_Pass_Probabilities_in_Soccer
Stöckl, M., Seidl, T., Marley, D. and Power, P. (2021). Making offensive play predictable
- Using a graph convolutional network to understand defensive performance in soc-
cer. 15-th MIT Sloan Sports Analytics Conference, Accessed November 20, 2021 at
https://global-uploads.webflow.com/5f1af76ed86d6771ad48324b/607a44a3c3d021c9cb
376186_PaulPower-OffensivePlaySoccer-RPpaper.pdf
Szczepanski, L. and McHale, I. (2016). Beyond completion rate: Evaluating the passing
ability of footballers. Journal of the Royal Statistical Society, 179, Part 2, 513-533.
Taki, T. and Hasegawa, J. (2000). Visualization of dominant region in team games and
its application to teamwork analysis. Proceedings of the International Conference on
Computer Graphics, 227-235.
Taki, T., Hasegawa, J. and Fukumura, T. (1996). Developmentof motion analysis system
for quantitative evaluation of teamwork in soccer games. Proceedings of 3rd IEEE
International Conference on Image Processing, Volume 3, 815-818.
Tan, J.H.Y., Polglaze, T. and Peeling, P. (2021). Validity and reliability of a player-tracking
device to identify movement orientation in team sports. International Journal of Per-
formance Analysis in Sports, 21(5), 790-803.
Torres-Ronda, L., Beanland, E., Whitehead, S., Sweeting, A. and Clubb, J. (2022). Track-
ing systems in team sports: A narrative review of applications of the data and sport
specific analyses. Sport Medicine Open, 8, Article 15.
67
Toumi, A. and Lopez, M. (2019). From grapes and prunes to apples and apples: Using
matched methods to estimate optimal zone entry decision-making in the National
Hockey League. Accessed on May 14, 2020 at https://rpubs.com/atoumi/zone-entries-
nhl
Vecer, J. (2014). Crossing in soccer has a strong negative impact on scoring: Evidence
from the English Premier League, the German Bundesliga and the World Cup 2014.
Accessed on February 5, 2019 at SSRN: https://ssrn.com/abstract=2225728
Vollman, R. with T. Awad and I. Fyffe (2016). Stat Shot: The Ultimate Guide to Hockey
Analytics, ECW Press: Toronto.
Voronoi, G. (1907). Nouvelles applications des paramètres continus à la théorie des formes
quadratiques. Primiere Mémoire: Sur quelques prepriétés des formes quadratiques
positives parfaites, Journal für die reine und angewandte Mathematik, 133, 97-108.
Wilson, J. (2013). Inverting the Pyramid: The History of Soccer Tactics, Nation Books:
New York.
Wu, L. and Swartz, T.B. (2022). A New Metric for Pitch Control based on an Intuitive
Motion Model. Manuscript under review.
Wu, L. and Swartz, T.B. (2022). Evaluation of off-the-ball actions in soccer. Manuscript
under review.
Wu, L. and Swartz, T.B. (2022). The calculation of player speed from tracking data.
International Journal of Sports Science & Coaching, 0(0).
Wu, L., Danielson, A., Hu, J.X. and Swartz, T.B. (2021). A contextual analysis of crossing
the ball in soccer. Journal of Quantitative Analysis in Sports, 17(1), 57-66.
Wu, Y., Xie, X., Wang, J., Deng, D., Liang, H., Zhang, H., Cheng, S. and Chen, W. (2019).
ForVizor: Visualizing spatio-temporal team formations in soccer, IEEE Transactions
on Visualization and Computer Graphics, 25(1), 65-75.
Yam, D.R. and Lopez, M.J. (2019). What was lost? A causal estimate of fourth down
behavior in the National Football League. Journal of Sports Analytics, 5, 153-167.
Yurko, R. and Pelechrinis, K. (2021). Evaluating defender ability to limit YAC. Accessed
November 20, 2021 at https://www.kaggle.com/ryurko21/evaluating-defender-ability-
to-limit-yac
68
Appendix A
Below is a sample of R code for Chapter 5 to compute the time to reach a given location.
1 # ################################################################
2 # ### Pitch Control
3 # ################################################################
4
5 # ###########################
6 # # functions to compute time take to reach a given location
7 # ###########################
8
9 compute _ t _ star <- function ( v _ x _ 0 , v _ y _ 0 , a _x , a _y , s _ max = 9.2) {
10
11 first _ term = -( v _ x _ 0 * a _ x + v _ y _ 0 * a _ y )
12 second _ term = sqrt (( v _ x _ 0 * a _ x + v _ y _ 0 * a _ y ) ^2 - ( a _ x ^2 + a _ y ^2) * ( v _ x _ 0^2 + v _ y _ 0^2 -
s _ max ^2) )
13
14 # only keep sol greater than 0
15 sol1 = pmax (( first _ term + second _ term ) / ( a _ x ^2 + a _ y ^2) , 0)
16 sol1 = ifelse ( sol1 == 0 , NA , sol1 )
17
18 # only keep sol greater than 0
19 sol2 = pmax (( first _ term - second _ term ) / ( a _ x ^2 + a _ y ^2) , 0)
20 sol2 = ifelse ( sol2 == 0 , NA , sol2 )
21
22 return ( pmin ( sol1 , sol2 , na . rm = T ) )
23 }
24
25
26 compute _ t _ x _ less _ than _ t _ star <- function ( x _ 0 , x _ 1 , v _ x _ 0 , a _ x ) {
27
28 second _ term _ in _ eq = sqrt ( v _ x _ 0^2 - 2 * a _ x * ( x _ 0 - x _ 1) )
29
30 # special case when a _ x = 0
31 sol1 = ( - v _ x _ 0 + second _ term _ in _ eq ) / a _ x
32 sol1 = ifelse ( abs ( a _ x - 0) < 0.000001 , ( x _ 1 - x _ 0) / v _ x _ 0 , sol1 )
33 # only keep sol greater than 0
34 sol1 = ifelse ( sol1 < 0 , NA , sol1 )
35
36 # special case when a _ x = 0
37 sol2 = ( - v _ x _ 0 - second _ term _ in _ eq ) / a _ x
38 sol2 = ifelse ( abs ( a _ x - 0) < 0.000001 , ( x _ 1 - x _ 0) / v _ x _ 0 , sol2 )
39 # only keep sol greater than 0
40 sol2 = ifelse ( sol2 < 0 , NA , sol2 )
69
41
42 return ( pmin ( sol1 , sol2 , na . rm = T ) )
43 }
44
45
46 compute _ t _ x _ larger _ than _ t _ star <- function ( x _ 0 , x _ 1 , v _ x _ 0 , a _x , t _ star ) {
47
48 t _ x _ sol = ( x _ 1 - x _ 0 + (1 / 2) * a _ x * t _ star ^2) / ( v _ x _ 0 + a _ x * t _ star )
49 # set time less than 0 to be NA
50 t _ x _ sol = ifelse ( t _ x _ sol < 0 , NA , t _ x _ sol )
51
52 return ( t _ x _ sol )
53 }
54
55
56
57 # ###########################
58 # example of computing the time to reach a given location
59 # ###########################
60
61 library ( dplyr )
62
63 data . frame ( start _ x = -21 ,
64 start _ y = 0 ,
65 end _ x = -25 ,
66 end _ y = 0 ,
67 vel _ x = -1.25 ,
68 vel _ y = 0.5 ,
69 accel _ x = -4.4694695 ,
70 accel _ y = 2.2413930 ,
71 s _ max = 9.2) % >%
72 dplyr :: mutate ( t _ to _ reach _ max _ speed = compute _ t _ star ( v _ x _ 0 = vel _x ,
73 v _ y _ 0 = vel _y ,
74 a _ x = accel _x ,
75 a _ y = accel _y ,
76 s _ max = s _ max ) ,
77 t _ x _ less _ than _ t _ star = compute _ t _ x _ less _ than _ t _ star ( x _ 0 = start _x ,
78 x _ 1 = end _x ,
79 v _ x _ 0 = vel _x ,
80 a _ x = accel _ x ) ,
81 t _ x _ larger _ than _ t _ star = compute _ t _ x _ larger _ than _ t _ star ( x _ 0 = start _x ,
82 x _ 1 = end _x ,
83 v _ x _ 0 = vel _x ,
84 a _ x = accel _x ,
85 t _ star = t _ to _ reach _
max _ speed ) ,
86 t _ y _ less _ than _ t _ star = compute _ t _ x _ less _ than _ t _ star ( x _ 0 = start _y ,
87 x _ 1 = end _y ,
88 v _ x _ 0 = vel _y ,
89 a _ x = accel _ y ) ,
90 t _ y _ larger _ than _ t _ star = compute _ t _ x _ larger _ than _ t _ star ( x _ 0 = start _y ,
91 x _ 1 = end _y ,
92 v _ x _ 0 = vel _y ,
93 a _ x = accel _y ,
94 t _ star = t _ to _ reach _
max _ speed ) ,
95 t _ x = ifelse ( t _ x _ less _ than _ t _ star <= t _ to _ reach _ max _ speed ,
96 t _ to _ reach _ max _ speed ,
97 t _ x _ larger _ than _ t _ star ) ,
98 t _ y = ifelse ( t _ y _ less _ than _ t _ star <= t _ to _ reach _ max _ speed ,
99 t _ y _ less _ than _ t _ star ,
100 t _ y _ larger _ than _ t _ star ) ,
101 t _ sol = pmax ( t _x , t _ y )
102 )
pitch_control.R
70