Numerical Algorithms For Predicting Sports Results: Jack Blundell Computer Science (With Industry) 2008/2009

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Numerical algorithms for predicting sports

results
Jack Blundell
Computer Science (with Industry)
2008/2009

The candidate confirms that the work submitted is their own and the appropriate credit has been
given where reference has been made to the work of others.

I understand that failure to attribute material which is obtained from another source may be considered
as plagiarism.

(Signature of student)
Summary
This project looks at how numerical data can be used to predict the outcome of sporting events. More
specifically, the project details specially created algorithms which make use of this data in order to
predict to outcome of American Football games.
The report seen here details the critical analysis of these algorithms when compared to the actual
match results. These algorithms range from using simplistic single-feature algorithms to complex sta-
tistical models. Furthermore, predictions made by the betting market are used here to compare the
accuracy of the project’s most accurate model. The report also includes a literature review describing
previous numerical models that have been used to predict the outcome of sporting events.

i
Acknowledgements
Firstly, I would like to thank my supervisor Dr. Katja Markert for all her time, help and support
throughout the project. Also, I want to acknowledge Dr. Andy Bulpitt for his important comments
within the marking of the mid-project report and the progress meeting.
Furthermore, I wish to thank fellow student Lee ‘Junior’ Tunnicliffe for proof reading my report,
which was very much appreciated. Lastly, I would like to ‘thank’ Kaj David for all his conversations
about Football Manager after I had sworn not to touch it this year.

ii
Contents

1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Project Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Minimum Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 Potential Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Project Planning 3
2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Original Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Revised Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Choice of Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Project Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.1 Statistical Difference Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.1.1 McNemar Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.2 Betting Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Background Reading 8
3.1 American Football . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 The National Football League . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.2 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.3 Spread Betting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.4 Power Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

iii
3.2 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Predictive Opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Judgment Opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Numerical Models for Predicting Sporting Results . . . . . . . . . . . . . . . 14
3.3.1.1 Models Within American Football . . . . . . . . . . . . . . . . . . 15
3.3.1.2 Models Within Other Sports . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1.3 Expert Opinions Within Sports . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . 22
3.4 Machine Learning Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 WEKA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Summary of Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Prototypes 25
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Prototype 1 (HOME) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Prototype Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Prototype 2 (PREV RES) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.4 Prototype Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Prototype 3 (Goddard & Asimakopoulos Model) . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2.2 Training and Testing Set Creation . . . . . . . . . . . . . . . . . . . 35
4.4.2.3 WEKA Vector Convector . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.2.4 Data Analysis Using WEKA . . . . . . . . . . . . . . . . . . . . . 35

iv
4.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.4 Prototype Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Prototype 4 (Inclusion of Ranking Features) . . . . . . . . . . . . . . . . . . . . . . . 37
4.5.0.1 Jeff Sagarin’s Power Ratings . . . . . . . . . . . . . . . . . . . . . 37
4.5.0.2 Football Outsiders . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.4 Prototype Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Prototype 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.3 Prototype Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Evaluation Against Betting Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Evaluation 45
5.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Overall Prototype Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.2 Usefulness of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.2.1 Feature Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.2.2 Ranking Of Features . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 Project Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 Objective and Minimum Requirements Evaluation . . . . . . . . . . . . . . . 48
5.2.3 Project Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.4 Schedule Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.5 Methodology Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Conclusion 50
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

v
Bibliography 51

A Personal Reflection 55

B Project Schedule 57

C PREV RES Algorithm 59

D Feature Ablation Results 60

vi
Chapter 1

Introduction

1.1 Introduction
The interest in sport has reached phenomenal heights over recent years with the help of satellite televi-
sion and sports channels such as Sky Sports and Setanta. A Sky digital customer for example has access
to over 25 sports channels [3] which combined with the vast amount of sporting information on the
Internet, it has never been easier to become interested in professional sports. As a result, the art of (suc-
cessfully) predicting the outcome of sporting events has become more sought-after. There are multiple
ways in which fans can predict the outcomes of such events, for example using prediction websites (e.g.
I Know The Score1 ) that are solely for fun between friends and other fans. Alternatively, fans can make
bets with a bookmaker either in the high-street or online. A glimpse of this sizeable gambling interest
can be seen with the English bookmakers Ladbrokes who have over 2,000 shops in the UK alone and
in 2008 announced pre-tax profits of 344 million! [8]. Needless to say, a lot of money can be made if a
person can make accurate predictions about sporting events.
This project used numerical data within differing statistical models to achieve predictions for Amer-
ican Football matches. This data included historical information about the competitors and their recent
results as well as other novel information that was viewed to help achieve the best predictor possible.

1.2 Project Aim


The aim of this project was to develop algorithms that used numerical data for predicting sport results.
In other words, it analysed how successful predictions can be by using numerical data alone, without
the use of subjective information such as opinions or contextual data. This investigation was carried out
within the domain of American Football.

1.3 Objectives
The objectives involved within this project include:

• To understand what information can help predict the outcome of an American Football match.
• To understand the different ways in which this information can be used to model a match.

1 http://iknowthescore.premierleague.com - Allows soccer fans to guess the score of English Premier League matches

1
• To creating a model in order to predict the outcomes of American Football matches.
• To discovering how successful this model can be by relying on numerical data alone.

1.4 Minimum Requirements


The minimum requirements of this project were:

• Development and implementation of existing sports prediction algorithms to apply to American


football.
• Development and implementation of enhancements to existing algorithms: integration of novel
features.
• Feature ablation studies to identify the most useful existing and novel features.
• Critical analysis of existing and enhanced numerical algorithms by comparison to actual match
results.

1.5 Deliverables
One of the deliverables produced by the project was a statistical model that used numerical data to
predict the outcome of American Football matches. The other deliverable was this detailed report of the
execution and findings of said model.

1.6 Potential Extensions


There were a number of potential enhancements noticed from the outset that could be applied to the
project and they were:

• To include subjective opinions from professional experts in predicting the outcome of a match.
• To compare the final model with that of the betting market predictions
• To see how betting patterns could be analysed and used to increase the prediction accuracy.
• To see how successful the model is using data from a different sporting domain other than Amer-
ican Football (e.g. ice hockey).

2
Chapter 2

Project Planning

Originally, the project was going to utilise not only numerical data, but also use NLP (Natural Language
Processing) to analyse the opinions of professional experts to achieve an accurate prediction. This was
going to extend work carried out by McKinlay in which he used textual analysis to extract predictions
made by American Football fans within Internet forums. However, he found that predictions made by
individual fans were laced with bias. In short, he concluded that fans were poor predictors of sporting
events [39]. In light of this, I intended to assess predictions made by professional experts and then
incorporate the numerical data to improve those predictions.
By Christmas, I had carried out background research on textual mining (and some research on
numerical prediction models). Unfortunately, after prolonged searching, I was unable to find the expert
data needed to proceed with the original project. Therefore, I decided that the numerical side of the
project was interesting and detailed enough to be concentrated on solely.

2.1 Methodology
The methodology for the project (in conjunction with the project schedule) was hugely important to
ensure structure was kept, deadlines met and also to see that the direction of the project was maintained.
After research was carried out to see which type of methodology would suit the current project, it was
clear that a prototype approach was appropriate. This was due to the problem being more akin to a
research investigation than a software engineering based project. This project’s prototype approach
was therefore centred around using numerical techniques (based on research discovered) to predict the
outcome of a group of matches. Then analysing how each technique performed and seeing where its
weakness lay with a view to rectifying this weakness in the next prototype.
Hughes and Cotterell state that “prototypes are working models of one or more aspects of the pro-
jected system that are constructed and tested quickly in order to evaluate assumptions” [33]. Hence,
this ability with prototyping to try ideas out and evaluate them quickly and at little cost to the project
benefited this investigation immensely. Alternative models like waterfall and V-process do not allow for
such iteration thus were not appropriate. It is also claimed that using a prototype is preferential when
there is a need to learn about an area of uncertainty from the developer’s point of view [33]. This was
clearly advantageous within this project as I had little previous experience with developing predictive

3
algorithms.
Prototypes can be split-up into two categories, throw-away and evolutionary. Throw-away indicates
that ‘test’ systems are developed and thrown away with a view to developing the true system [33].
Evolutionary describes when a prototype is developed and modified until it is finally at the stage where
it can become the proposed solution [33]. An evolutionary technique would be advantageous if a new
prototype was simply the previous prototype but with features either added or removed. However, some
prototypes held no predictive qualities at all and thus were discarded from the next iteration. Therefore
a combination of the two was seen to be the best approach.
Having said this, there are risks associated with prototyping. These risks include the possibility
of poor standards being carried out within the project, meaning the developer is more inclined to use
programming ‘hacks’ [33]. This would hinder the program’s consistency and flexibility therefore these
‘hacks’ were avoided. Furthermore, the code that was created for analysing one prototype was non-
specific and flexible so that it could be used in future prototypes (thus saving time).
According to [33], to be considered a genuine prototype approach, the following must be executed:

• Specify what is hoped to be learnt from the prototype.


• Plan how the prototype is to be evaluated.
• Report on what has actually been learned.

Therefore it is my aim to clearly carry out these points within this project report.

2.2 Original Schedule


Figure B.1 indicates the initial project schedule. The first thing the reader will note is the grey block
within December and January. This represents my exam period which I decided to concentrate exclu-
sively on and thus needed to manage my project so that it did not suffer because of this predetermined
neglect. The basis of the original plan was that all of the background reading be carried out within
the first term. The majority of which would cover the textual analysis aspect of the project with a
smaller amount of numerical prediction theory. This would then be used to formulate the design of a
textual-based algorithm after Christmas.
In terms of the numerical prototypes, the first two would be simple baseline algorithms that required
no background reading in terms of statistical analysis. These would rely on common patterns associated
with sporting outcomes (e.g. choose the home team to win). Then, numerical prototype 3’s design
would depend on initial research carried out on complex predictive models along with the findings from

4
the first two prototypes. The implementation of this prototype would start before Christmas and be
finished after the exam period. After assessing both numerical prototype 3 and the textual prototype,
I would then combine the numerical and textual prototypes into one prototype which would then be
evaluated. The best prototype from all of the work previously carried out would then be compared to
the betting line.

2.3 Revised Schedule


In Figure B.2, we see that the original plan was changed. Clearly, the textual prototype was discarded,
however the background reading is still present as this had already been carried out. Also, as my project
had become solely focused on the numerical algorithms, this meant more time was needed to carry out
further numerical research.
The stage ‘Review of first three prototypes’ represents me looking at how these numerical proto-
types had faired and how to proceed from that point. Moreover, as I had no idea how successful these
prototypes would be or what where their weaknesses were, the amount of further prototypes required
was unknown. Therefore, a period of a month was allocated to build upon the first three iterations with-
out a predefined final number of prototypes in mind. This would depend on how successful each was
and what knowledge had been gathered within the background reading. Clearly, by me setting no upper
or lower limit on the amount of prototypes needed, there was potential for a certain lack of discipline.
However, as I was not aware of how successful each prototype would be, this was essential. The impor-
tant decision was to allow enough time in which to develop these future prototypes. The final prototype
would be tested at the end of March against betting odds to reach an understanding of how accurate it
really was.
One other point to note is the large period devoted to ‘Write up and Reflection’. Due to the lengthy
process I found when writing the mid-project report, I decided to stretch the process out whilst doing
other parts of the project so it would not be rushed within the space of a couple of weeks1 .

2.4 Choice of Programming Language


There were many languages considered to help implement this project, one of which was Python, a
powerful dynamic programming language that is used in a wide variety of application domains [12]. It

1 Clearly this process was very selective as some parts of the report could not have been written if they had not been carried
out at that point

5
has clear, readable syntax and does not restrict the user to Object-Orientated programming. For these
reasons, it is recognised as being very efficient as generally only a relatively small number of lines are
needed to write simple programs. This is especially true when ‘spidering’2 [39], however this was not
seen to play a large part within the project.
Java was another language that was considered. Java is a programming language originally devel-
oped by Sun Microsystems which relies on the Object-Orientated paradigm and is used by 6.5 million
developers worldwide [9]. My knowledge of Python is far from extensive and therefore understanding
the language would become another aspect of the project. The learning curve would have benefited my
programming experience in the long-run. However, as the importance of the project was significant I
decided to use Java as I am far more accomplished using this programming language and overcoming
obstacles within the project was seen to be easier if I already had knowledge of the language used.

2.5 Project Evaluation


The prediction accuracy was used to assess the success of each prototype. This was the number of
correctly predicted games divided by the number of predicted games (%) within the specified test set:

Prototype accuracy = No.Total


of correctly predicted matches
no. of predicted matches ∗ 100

2.5.1 Statistical Difference Tests

Statistical significance assesses the likelihood a result is, given that the null hypothesis is false (where
the null hypothesis states that any result is attained due to chance) [42]. Thus, a measure was needed to
judge how likely it was that the difference in two prototype accuracies were obtained through chance.
Dietterich put forward the use of the McNemar test whilst analysing five such measures [21]. When
comparing these tests3 , he found that as other tests required the training and test sets to undergo forms
of cross-validation, the McNemar test was ideal when dealing with expensive algorithms [21]. It was
hard to say how time-consuming my algorithms would be but I envisaged using text file parsing which
can be computationally expensive. Not only this but Dietterich found that the McNemar test was one of
the most reliable significance measures within the set of five. This was shown when ‘Type I’ error tests
were undertaken. A Type I error is when a statistical significance measure declares two algorithms as

2 Browsing the Web for data in an automated manner


3 The 5 measures assessed were: Test for difference of two proportions, two paired t test (one based on random train/test
splits, one based on 10-fold cross-validation), McNemar and the 5x2cv test

6
being different when they are in fact similar. The McNemar test was found to have a consistently low
error-rate when tested on Type I [21]. On consideration of this, I decided to use the McNemar test.

2.5.1.1 McNemar Test

The McNemar test makes use of matched binary pairs taken from the output of two distinct predictive
models (A and B) where each pair relates to correct and incorrect forecasts [24]. These pairs are then
compared with each other and placed into four categories:
• A was correct, B was correct (a)
• A was correct, B was incorrect (b)
• A was incorrect, B was correct (c)
• A was incorrect, B was incorrect (d)
The totals of the two discordant results (i.e. where A and B achieved different results) are then placed
into the following McNemar formula [24]:

(b−c)2
χ =2
b+c
This χ 2 value represents the McNemar statistic which, when referred to a table of χ 2 distribution [35],
will reveal the level of significance between the two models. This ‘p’ value extracted from the table thus
represents how likely it is that the difference in accuracy between two models was achieved through
chance. For the purpose of this project, we shall generally look to using the p < 0.05 level of significance
(there is less than 5% chance the differences are down to chance). If a significance value below this 0.05
threshold is achieved then we can reject the null hypothesis.

2.5.2 Betting Line

The final prototype will be placed against bookmaker’s predictions. Discussed at further length within
my background reading, I found the bookmakers more often than not achieve the best forecasting results
when it comes to predicting sporting events. Therefore when my final prototype was completed, it was
planned to compare its forecasts with that of the bookmakers’. Hence, this betting data needed to be
obtained.
During my background reading I came across a paper [29] where this kind of data was used, so
I contacted one of the authors. The author, Prof. Philip K. Gray sent me the betting spreads which
accounted for over 3,000 NFL matches played between 1976 and 19944 .

4 Spread data was not available for every game within this time period

7
Chapter 3

Background Reading

The background research carried out within this project is dissected into three areas. The first section
explains why American Football was chosen as the sporting domain as well as some information on the
sport. Section 3.2 describes researched work within the field of textual analysis which would have been
used to analyse expert predictions. Lastly, Section 3.3 covers work analysing how numerical statistics
have been used to predict the outcomes of sporting events.

3.1 American Football


An important decision had to be made regarding the sport to be used within this project. This sport
needed to adhere to requirements that would ensure that the project had a fair chance of being imple-
mented successfully. These requirements included:

• Extensive history of results - Ensures that the sport has been played for enough years so that
data would reach back far enough to allow for extensive analysis.
• Home advantage - It would be useful to assess how home advantage affects the outcome of a
match. Therefore, every event had have a ‘home’ competitor associated with it.
• Low frequency of ties - To predict the outcome to be a draw would have been be difficult, thus a
sport where tied games are rare was preferable.
• Regular seasonal fixtures - If the sport involved some sort of league structure with games being
played every year during set months then the extraction and analysis of data within the project
would be benefited.

This lead to American Football being considered. The sport fulfils all of the requirements above and
therefore I decided that it would be a suitable choice. I investigated the scoring system of the sport as
this knowledge was required when considering the potential results a game might have. Furthermore, a
quick look into the match and league structure may aid in the designing of algorithms.

3.1.1 The National Football League

The NFL is the professional league of American Football. It currently has 32 teams which are split up
into two 16-team conferences which are in turn split up into four 4-team divisions. Each team plays a
16-game season encompassing the majority of teams in their division but a team can also play teams

8
outside their division or conference1 . The winners in each of the 8 divisions and the 2 best runners-up
from each conference proceed to a knock-out tournament known as the Playoffs which culminates in
the final championship game (the Super Bowl) [11]. The NFL is an American institution and is known
worldwide. To grasp just how popular it is, a worldwide television audience of 148.3 million tuned-in
to watch the 2008 Super Bowl between New York Giants and the New England Patriots [14].

3.1.2 Rules

Each team has 11 players and at any one time, one side’s group of 11 will be the offensive team (i.e.
the team in possession of the football) and the opponents use their defensive players. The aim of the
offensive team is to get into the area past the goal line (known as the ‘end zone’) with the ball. They
can do this through what are known as ‘plays’ which can involve either running with the ball or passing
the ball by throwing it (or a combination of the two). The play stops when the defensive team tackle the
player with the ball or the ball goes out of play [10].
The offensive team start with ‘first down’ and have four plays to get tackled 10 yards further up the
pitch (to where they started from ‘first down’). If they fail then the roles of the teams are reversed, the
offensive team become the defensive team (and vice versa). If they do however gain 10 yards then they
reach another ‘first down’ and have four more plays to achieve another 10 yards and so on.
Each match is split up into four 15 minute quarters. If the teams are level after this initial 60 minutes,
then a 15 minute overtime period is played where the first team to score wins. If the teams are still level
then the game is drawn, but as the point scoring system suggests, draws are very rare. Points can be
scored in the following way: [10]
Touchdown - Worth 6 points, and is achieved by carrying (or receiving) the ball into the end zone.
Conversion - After a touchdown, the scoring team can either kick a conversion for 1 point, or attempt a
more complex 2 point passed conversion.
Field Goal - For 3 points, a team can kick the ball through the posts and over the crossbar.
Safety - 2 points can be scored by tackling an offensive player in his own end zone.

1 This assignment of games is complex and will not be recognised in this project as the prediction algorithms will not take
into account a team’s division or conference

9
3.1.3 Spread Betting

Spread betting is a form of gambling in American Football whereby the bookmaker specifies the margin
by which he thinks the ‘favourite’ team will win by. It is then up to the bettor whether he thinks this
favoured team will win by more than the specified margin, in which case he bets for the favourite.
Otherwise, if the bettor thinks the ‘underdog’ will win (or fail to lose by that margin) then he will bet for
the team viewed as the underdogs. If the favourites win by the exact margin the bookmaker specified,
the bet is regard as a ‘push’ [46] and the stake is returned. For example, the bookmaker putting a 7+
spread on Atlanta over Dallas signifies Atlanta are the favourites to win by more than 7 points. Imagine
the bettor places a bet on Atlanta, then if:
• Atlanta win by more than 7 points - The bettor wins the bet.
• Atlanta win by less than 7 points - The bettor loses the bet.
• Atlanta win by 7 points exactly - A push occurs and the bettor gets his stake returned.
• Dallas win or the match is drawn - The bettor loses the bet.

This form of betting incurs a weighted advantage towards the bookmakers in the form of commission.
This means that for a gambler to achieve a positive return on their bets, they must have an average
prediction accuracy of 52.4% [29].

3.1.4 Power Scores

Power scores are a popular way within the media to represent a football team’s current strength. These
are carried out by various newspapers and media sources, each using their own statistical methods.
Although fairly secretive, these rankings are assumed to include information such as previous results,
the strength of the opponents played and overall defensive/offensive capabilities [18, 44]. Every week,
each team is assigned a value representative of their current overall quality, these values are then used
to sort the teams into ‘power rankings’. The rankings can then be used by gamblers to assess whether
one team will be victorious over another in an up and coming game.
These power rankings are held in high regard within the footballing community. As there is no
NFL-style Playoff system in college football, power rankings are used to determine which college teams
performed better throughout the season in order to compile a final table. These rankings are compiled
using the most accurate ranking systems from the media. Among the systems that are used is one created
by Jeff Sagarin who publishes his figures within the national newspaper USA Today [18]. Clearly, such
data could be used to predict the outcome of an American Football game. Thus, after some initial

10
research I found that Jeff Sagarin’s archived team ratings are available from 1998 onwards [16].

3.2 Text Mining


Text mining is the process of analysing text to extract information for a particular purpose [48] and
research was carried out into how this can be done. Generally speaking, text can take the form of fact or
opinion and work was performed to see if the two could be separated. Yu and Hatzivassiloglou used a
Naive Bayes classifier to determine whether a document is factual (e.g. news story) or opinionated (e.g.
editorial) and found that it was harder to classify sentences than it was to identify documents in this way
[49]. The main focus of my research was opinion-based as they form the basis of expert predictions.
Kim and Hovy define two types of opinions, predictive and judgmental. Predictive opinions express
a person’s opinion about the future of an event whereas judgment opinions express positive or negative
attitude towards a topic [36]. A predictive opinion would be “I think that Miami will lose tomorrow”
and an example of a judgmental opinion is “I think Miami Dolphins are awful”. Furthermore, each
predictive and judgmental opinion can have sentiment attached to it. The sentiment of a sentence is
the feeling (positive, negative or neutral) which is implicitly stored within [36]. Sentiment can also be
referred to as polarity, where the polarity of a sentence is either positive or negative [49].

3.2.1 Predictive Opinions

As the project was originally going to analyse expert predictions, the analysis of these predictive opin-
ions was very important. With regard to this, Kim and Hovy developed a system called Crystal by which
they explored the use of generalized lexical features within posts on a Canadian electoral forum [36].
They used supervised learning on these features to predict which party would become victorious within
a certain riding2 based on the forum predictions [36]. This was done through a SVM (Support Vector
Machine) approach using varied feature combinations. They found that using a combination of uni, bi
and tri-grams, they could successfully decipher the predicted party within the message with an accuracy
of 73%. Furthermore, they could predict the result of a riding at an accuracy of over 80%.
The Crystal system has been influential in this area of study as similar techniques have been used
but in different domains [39, 17]. In [17], Baker took the Crystal idea and created a system whereby
forum posts were taken from a UK election site and used to predict the outcome of an upcoming elec-
tion (CrystalUK). He used a combination of uni, bi and tri-grams as his feature set and using a SVM

2 Canadian equivalent to constituency

11
approach, achieved a higher constituency prediction accuracy when compared to the original. However,
the system’s message prediction accuracy fell 4% short of the standard set by [36] with an accuracy of
69%. The system was then extended (Crystal2.0) to use pronoun resolution3 as a feature within classi-
fication. Crystal2.0 also took into account the SVM classification strength of the sentences, rather than
just whether they were postive or negative. Although Baker claimed this created a more robust system
capable of generalising to smaller data sets, the system did not improve on CrystalUK [17].
More relevant to this current project, Crystal has also been used within an American Football context
[39]. McKinlay used the ideas produced by [36] to analyse fan’s predictions within forum posts [39]. As
with [36] and [17], an n-gram feature combination was implemented to classify the data. As well as the
SVM approached used above, McKinlay also looked into using a rule-based classifier called RIPPER4.
This system reached a message prediction accuracy similar to that of [17] and [36] but only reached a
52% accuracy when predicting the outcome of a match. He suggested that this poor accuracy (marginally
above a random naive approach) was dependent on the accuracy of the fans’ predictions as opposed to
the quality of the system. This theory was the original project’s basis of analysing professional expert
opinions and whether they were any better than the fans’ forecasts.

3.2.2 Judgment Opinions

Film reviews appear to be a good source from which judgement analysis can be formed [45, 40]. In [40],
Pang, Lee and Vaithyanathan found that when using a variety of standard machine learning techniques,
sentiment classification is much harder to achieve than simply classifying the topic of the text. They
suggested this is down to the reviewing author sometimes employing a ‘thwarted expectations’ narrative
in which he/she will use positively orientated words, only to come to the conclusion that the film is poor
and vice versa [40]. One example of this is “The film should be great, the cast is good, the director is
experienced but it turns out to be a film to miss”. These conclusions would have been important during
the analysis of expert opinions as they could have contained such ‘thwarted expectations’.
This is supported within [45] where it is claimed that “the whole is not equal to the sum of the
parts”. Here, Turney uses a simple unsupervised learning algorithm to classify reviews for films, banks,
cars and travel destinations as either recommended or not recommended. Sentiment classification here
3 Where pronouns are replaced with the previously mentioned candidate or party, thus increases the frequency of that
candidate/party within the post
4 Where both Ripper and SVM were found to produce similar results

12
performed poorly as some film reviews include words with negative connotation such as ‘blood’ and
‘evil’. Although these words do not invoke a positive reaction, they could be used to give a good review
of a horror film thus leading to misclassification. However, he found that banks and cars were generally
easier to classify and concluded that the whole is the sum of the parts in these contexts [45]5 .
Sentiment mining has been carried out under varying detail: document, sentence and word-level.
Revisiting [45], Turney used a number of techniques to classify a film review as recommended (or not).
He used a Part-Of-Speech (POS) tagger to extract phrases containing adjectives or adverbs within the
film reviews. He then used an Information Retrieval tool to see the similarity of these phrases with the
words ‘excellent’ and ‘poor’. This managed to achieve an overall review classification of 75% [45].
This was improved within [40] using differing methods, where an accuracy of 83% was achieved.
They used Naive Bayes, Maximum Entropy and an SVM (Support Vector Machine) classifier on a
standard bag-of-words framework on each film review. Then using a variety of features including com-
binations of unigrams, adjectives and a POS tag attached to each word, each of the three classifiers were
carried out. These all reached accuracies between 73% and 83% but it was the SVM when used with
just unigrams that achieved the highest classification rate [40]. One interesting feature that was used
within this study was looking at the position of the word within the document. This worked on the basis
that the summary of a review is usually found at the end of the document and although this feature did
not improve the accuracy a great deal, this is worth considering within a textual algorithm.
As well as looking at fact/opinion classification, Yu and Hatzivassiloglou investigated sentiment
classification at sentence-level [49]. The authors tried to distinguish between positive, negative and
neutral sentences within Wall Street Journal articles. They used the hypothesis that positive words
coincide more often than just by chance (and the same for negative words) [49]. Using this, they
applied a seed set of semantically-orientated words to calculate a modified log-likelihood ratio for each
word within a sentence. This ratio represented the word’s sentiment, thus an average of the ratio was
used to classify the sentence’s sentiment (i.e. the amount of positive and negative words in a sentence
determining its polarity). They found that using a combination of adjectives, adverbs and verbs from
the seed set yielded the best results [49].
More detailed studies have classified the sentiment of individual words or phrases, more specifically
the sentiment of words found in subjective expressions [47]. Wilson, Wiebe and Hoffmann define a
subjective expression as any word or phrase used to express an opinion, evaluation, stance, etc [47].

5 Turney found that travel destination reviews were somewhere in between the two extremes

13
The authors used a lexicon of over 8,000 single-word subjective clues to find the polarity of expressions
within a corpus. This was helped by distinguishing the difference between a clue’s prior and contextual
polarity, where prior indicates the polarity of the word on its own and contextual polarity indicates the
sentiment of the word within a specific phrase. Here is how a prior polarity and contextual polarity
could differ within expert opinions:
Prior polarity - ridiculous [negative].
Contextual polarity - The Dolphins’ wide receivers have a ridiculous amount of pace [positive].
Using a simple classifier where the prior polarity was used to predict the contextual polarity of a clue,
they reached an accuracy of only 48% noting that a lot of words with non-neutral prior priority appeared
in phrases with neutral contextual polarity. Thus they devised two classifiers, one to firstly classify
a clue’s context polarity as neutral or polar, then a second to decide the clue’s actual polarity (either
positive, negative or both).
The first classifier was done using 28 different features to separate all the neutral and polar words.
Subsequently, clues that were classed as polar were then used within the smaller second classifier (10-
feature) to decipher polarity. The 28-feature classifier managed to distinguish between polar and neutral
words/phrases with 75% accuracy. However, the accuracy of the second classifier achieved an accuracy
of 65% concluding that the task of classifying between positive and negative words is more difficult
than classifying between polar and neutral [47].

3.3 Numerical Analysis


3.3.1 Numerical Models for Predicting Sporting Results

Various numerical research has been carried in a sporting context to see how statistics can be used
to formulate a prediction to an outcome. Other than solely looking at research regarding American
Football, I decided to investigate similar studies using English football (soccer). This was due to the
popularity of the sport and thus the popularity of studies using that context. This can be justified by
the fact that there are a number of the papers regarding American Football which cited research into
modelling soccer matches [18, 43] and vice versa [27, 25].
After some initial research, I noticed that the motive of many authors to create such models centred
on trying to beat the gambling market. The authors cited different reasons for this approach. Many
saw this as a good way of assessing their proposed predictive model as the gambling market is generally
considered the most accurate source of prediction [18, 31, 25]. Forrest, Goddard and Simmons conclude

14
that this is due to the financial incentive associated with a bookmaker’s prediction when compared to
that of statistical systems or experts [25]. The most common reason amongst authors for comparing
their proposed systems to gambling data was to find whether inefficiencies occur in the betting market
[46, 29, 28, 27, 22]. A market is efficient if no betting strategy exists whereby on average that strategy
yields significantly positive returns [29]. Alternately, some authors take the view of creating a model
simply for the purpose of betting (and presumably to make money!) [41].
Some studies were carried out to see how accurate ‘expert’ predictions regarding sporting events
are and how these compared to the forecasts of the numerical models [43, 18]. Originally I researched
these with a view to compare them with that of my textual experts’ findings. However, this research was
useful to assess the actual level of expertise and knowledge that are involved in these predictions and to
see if just by using historical data, these predictions could be bettered through statistical modelling.

3.3.1.1 Models Within American Football

A lot of research I came across involved models that did not simply attempt to predict the winner of a
match but tried to ‘beat’ the spread (i.e successfully predict the margin by which a team wins). Stefani
highlighted the difference in predicting the winner of a game and predicting the margin of victory.
He claimed that good forecasters for the winners of NFL matches were around the 70 percent mark
whereas to achieve a good profit by trying to ‘beat’ the spread, an accuracy of well over 54% is needed.
He concluded that if both are compared to a random approach (50%) and not many systems are superior
to this spread boundary then the latter task is clearly harder [44].
One model attempted to beat the spread using an Ordinary Least Squares (OLS) approach to high-
light certain biases made by the bookmakers between 1973-1987 [28]. They used three independent
variables, ‘home’, ‘favourites’ and the ‘spread’ within the model. Each NFL match was then modelled
from the focus of one of the competing teams (the team was chosen at random). For example, if a team
were playing away and were favourites then the ‘home’ value would be 0, ‘favourite’ would be 1 and
the ‘spread’ would be the amount by which bookmaker thought the team would win by. The dependent
variable for each match was the difference between the actual margin of victory/defeat and the predicted
spread (i.e. positive/negative if the team beat/lost to the spread, zero if a ‘push’ was found).
Using this model, Golec and Tamarkin found that biases against underdog teams and more specifi-
cally underdog teams playing at home were present within the betting market. They used these theories
to predict the games between 1973-1987 and found that when betting on home teams that were under-
dogs, a winning percentage of 55.6% could be achieved [28]. This is above the profit boundary of 52.4%

15
indicating that this could be a useful strategy for predicting matches, thus showing inefficiencies within
the market. However, these tests were carried out on the same data which formulated the theory thus
further tests on matches outside of the dataset would be needed (in my opinion) to confirm this.
Another piece of research was carried out in the same vein as Golec and Tamarkin’s work but using
a differing technique. In [29], Gray and Gray viewed the OLS approach used within [28] as flawed.
They deduced that the OLS system gave more weighting to games where the victorious team beat the
spread by a large amount. This approach is not desirable when trying to beat spread betting, as no matter
how well the team beat the spread, the bet is still regarded as a win.
Subsequently, they preferred to use a discrete-choice probit model where the dependent variable
represented whether the team beat the spread or not (rather than how much they beat/lost to the spread).
Each match was then modelled from the home team’s perspective using multiple variables. One variable
represented the winning percentage (in terms of beating the spread) for the two teams in the current
season. Also, a variable representing how many times the teams have beaten the spread in the last four
games was used along with one indicating whether the home team was the favourite or not.
When processing this model on their dataset (matches between 1976-1994), the weight associated
with the ‘favourite’ variable had a negative coefficient meaning that this parameter had poor correlation
to a team beating the spread [29]. This was found to be significantly different to zero therefore reaffirmed
the findings within [28] that home-underdogs are reasonably likely to beat the spread.
They also concluded that teams that have not performed well in recent games (relative to the spread)
are more likely to beat the spread than if they had recently performed well surmising that the bookmakers
overreact to the recent form of a team [29]. Using the matches within the dataset, the probit model
achieved an accuracy of 54.46% when predicting whether a team would beat a given spread and 56.01%
when tested on ‘held back’ data (both of these accuracies beating a home-underdog approach).
The model was improved by taking into account the probit probability of each modelled match
and only betting where the probability of the team beating the spread was over 0.575. This reached a
significant success rate of 56.42% using in-sample matches and an even greater accuracy using the out-
of-sample data (however this was found to be statistically insignificant). Therefore, the method used
within [29], which ignored the magnitude of the spread and included recent form of each team produced
more accurate predictions than those based on Golec and Tamarkin’s theory (home-underdog).
Bouiler and Stekler also favoured a probit model approach within their study of simply predicting
the winner of NFL games [18]. Rather than using recent form or progress over the course of the season,

16
they investigated whether power rankings could be used in order to predict the outcome. Initial tests
were carried out over matches played between 1994 and 2000 to see if choosing the higher ranked team
would provide an accurate strategy. This resulted in predicting the correct result 60.8% of the time. The
accuracy was then compared with the predictions of a New York Times sport editor (59.7%), predicting
the home team (61.1%) and the predictions of the betting market (65.8%) [18].
As this probit model did not even beat simply choosing the home team, another probit model was
used to assess the probability of a team winning depending on the actual power ranking difference. It
was found that as the magnitude of the differing rankings increased, the probability of the higher ranked
team winning also increased. For example, if a team was ‘power ranked’ 3rd they would have a greater
chance of beating a team ranked 14th than a team ranked 5th. Thus by predicting the winner in all games
with a probit probability of 0.5 or higher achieved a forecast accuracy second only to the betting market
(beating home prediction and that of the editor) [18]. These findings concluded that the betting market
was the best predictor out of the approaches covered, although power scores do hold information that
can be used to achieve an acceptable prediction success rate.
Another instance of research used a form of ratings to assess the accuracy of predicting the margin
of victory as well as the match winner [44]. Here, Stefani used an OLS approach (similar to that of
Golec and Tamarkin) in which the margin of the victory was predicted for each match. He based these
predictions upon his own ratings for the teams involved.
The ratings were based on the margin of victory from previous games which also took into account
the strength of the opponents (i.e. the rating of the opponent at that point in time). These ratings were
used in conjunction with a constant representing the home advantage. This constant was calculated by
subtracting the average number of home points away from the average number of away points within the
dataset6 . As the model relies on the concurrent updating of ratings, it was tested week-by-week during
the 1970-1979 NFL seasons to see how many winning teams could be successfully predicted. These
tests were then compared to the accuracy of the betting line for those games.
When predicting the winner, the least squares (OLS) model achieved an accuracy of 68.4% with
the ‘home’ constant which saw a 2 percent dropped when this constant was omitted from the model
(highlighting the NFL home advantage). However, even with the home advantage considered this still
fell short of the betting line which reached a success rate of 71% [44]. This further lays claim to the fact
that betting lines are generally superior in predicting matches compared to statistical models.

6 This home constant was found to be around 2 points for the NFL

17
Stefani then proceeded to see how the model faired when trying to predict the actual margin of
victory. A successful prediction was seen to be if the predicted outcome was in the same ‘direction’ as
the actual outcome and above the margin of victory. The results were categorised in terms of how many
points the model’s predicted outcome was away from the betting spread for that game (1-2, 3-4, etc).
This saw that if the model was only 1 or 2 points away from the betting line then it could accurately
predict margin of victories at 58.4% [44].
In [31], Harville used a mix of complex linear-models to predict the outcome of American Football
games between 1971 and 1977. The system relied on the differences of team’s yearly characteristics.
These characteristics were the number of points scored/conceded by that team in relation to an evaluated
‘average’ team within that year. These are similar to the ratings used within [44] as they represent the
difference in strength of the two teams. The proposed model also takes into account the home advantage
in a similar vein to the one within [44]. The optimal values for the model’s parameters were then
found through a maximum likelihood procedure. The model achieved an accuracy of 70.3% which only
slightly fell short of the predictions made by the betting line during that time period [31].
He acknowledges a flaw within the model described as it does not currently constrict incidents of
teams scoring large amounts of points against weaker teams (e.g. 54-0) known as ‘running up’ the
score [30]. These outcomes in his model will receive higher weighting and he claims that this has an
adverse effect on the modelling process and should be somehow restricted [31]. Harville states that
the key to successfully predicting sporting events is related to the rating or ranking of teams (e.g. his
‘characteristics’ or power scores in [18]).

3.3.1.2 Models Within Other Sports

One statistical model assessed the efficiency of the soccer betting market by predicting the outcome of
English football matches [27]. Here, Goddard and Asimakopolous used an ordered probit regression
model to assess whether weak-form inefficiencies occur within the betting market. That is, whether all
historical data (previous results, etc) relevant to a specific match is contained within the odds, otherwise
the market is found to be weak-form inefficient. As well as implementing soccer equivalents of some of
the features seen in Section 3.3.1.1 (i.e. recent performances or season win ration), the probit model also
incorporated a number of novel features. Some of these included the distance between the two teams
and whether the match is significant in terms of promotion or relegation for either team. The model
was estimated using 15 years worth of previous data and tested on English football games between
1998-2002. These games would then be examined against odds from 5 separate bookmakers.

18
To compare this system to the bookmaker’s odds, a separate model was needed to represent these
odds. This was done through a simple linear model by regressing the result of a match (home, away or
draw) against the implicit bookmaker’s probabilities attained from the odds for that match. They then
added the probability of the result from the probit model to the bookmaker’s model in the form of a vari-
able. The authors surmised that if information held within the probit model was already present in the
bookmaker’s odds then this variable should be insignificant when the linear model was re-estimated. In
other words, the information used within the probit model predictions should not help the bookmaker’s
model in reaching the prediction of a match. However, it was found that the variable was significant at
0.01 level concluding that the model they proposed does contain information not enclosed within the
bookmaker’s odds. The authors determined that this showed the soccer betting market was weak-form
inefficient. This was backed up by claims they could get a non-negative return on betting when using
the model’s highest result probability for each match and taking the best odds for that result [27].
One interesting use of team strengths was undertaken by Rue and Salvesen in which they catered
for the psychological effect in relation to the difference between the qualities of the competing teams.
Their work was carried out using a Bayesian model for home and away teams whereby the goals scored
by one team relied on the difference between their attacking strengths and the opposition’s defensive
strengths. The psychological variable was attained by calculating the difference between the two team’s
collective strengths (i.e. home attack + home defense - away attack - away defense). This is based on
the assumption that if a team is far superior to the other team, then complacency could set in for the
better quality team thus giving the perceived weaker team an advantage [41].
Since the magnitudes of victory within previous matches were taken into account, the authors im-
posed a restriction on the magnitude of goals scored in previous games to 5. For example, if a previous
game ended 7-0 then this would be recorded as 5-0 as they deduce that goals past this mark are not in-
formative within the model development. This underlines the claim made by Harville that teams should
not be allowed to ‘run’ up the score [31]. When tested on the second half of the 1997-1998 season (us-
ing the first half of that season to collect information about the team’s respective strengths) it reached a
similar prediction performance when compared to the bookmakers odds available for those games [41].
Away from soccer, Hu and Zidek looked into forecasting basketball games [32]. More specifically
they used a Weighted Likelihood approach in modelling NBA Playoff games to predict a winner. Firstly,
they define two types of historical data which can be used to predict the outcome of a game. One of
these is ‘direct’ information which refers to data that is only relative to matches between the two teams

19
(i.e. the most recent results between the teams) [32]. The other type is called ‘relevant’ information,
which is simply attributed to all other data that can be used to predict the outcome [32].
In this study, the authors used all the games played within a season to predict the outcome of the end-
of-season Playoff games7 . Here, the ‘direct’ data refers to the results of games involving the two teams
within that season. All the other games played by each team within the year were used as ‘relevant’
information. Therefore, if Chicago Bulls play at home to Orlando Magic in a Playoff match, the model’s
‘direct’ data refers to the results between the two teams during that season where the Bulls were at home.
The ‘relevant’ results are where Chicago Bulls played at home in that season against other teams along
with the results of Orlando Magic’s games when they played away. Only the results of these games are
used and not the magnitude of victory/defeat. It is also noted that this approximation does not take into
account the strength of the opposition within these previous results. They do however try to combat
this in modifying the model by removing the weaker teams from the ‘relevant’ results (only recognising
teams that had won over 50 out of the 80+ games) which provided better results than the original version.
This model, whereby the weaker teams were excluded proved successful in predicting the winner of the
1997/1998 Chicago Bulls and Utah Jazz Playoff match [32].
Lastly, Stefani proved that models can be transferred across different sports. He did this by applying
his least squares approach (mentioned previously within Section 3.3.1.1) to basketball and soccer [44].
Initially, he used the model to predict college basketball games between 1972 and 1974 with an accuracy
of just short of 70%. This was bettered when he adapted the system to the 1974 World Cup when an
accuracy of 74% was gained in predicting the results throughout the tournament [44]. This suggests that
consistent forecast rates can be achieved when a model is used within a new sporting domain.

3.3.1.3 Expert Opinions Within Sports

Song, Bouiler and Stekler looked into the NFL predictions of statistical models and experts on a mass
scale [43]. They collected the predictions of 70 experts (from the national media) and 32 numerical
models (from various football research) to get forecasts for American Football games between 2000
and 2001. They found that on average, the expert and system accuracies were relatively similar when
attempting to predict the winner achieving 62.2% and 61.7% respectively8 . However, yet again these

7 Each basketball team plays around 80 games in a season and therefore encompasses more information than a soccer or
NFL season
8 Result were not found to be statistically significant

20
forecasts were over shadowed by the accuracy of the bookmaker’s predictions. It was also noted that
the dispersion of accuracies was much higher amongst experts than the statistical systems, achieving the
best and worst accuracies. This shows us that forecasting success is much more consistent when dealing
with numerical systems than experts [43]. This can be backed up within research by Bouiler and Stekler
(mentioned in Section 3.3.1.1) which showed that the New York Times Editor predictions were worse
than simply choosing the home team within NFL games [18].
Next, they decided to use the forecasters mentioned above to predict the game’s margin of victory
and beat the betting spread. This found that both sets of approaches were (on average) short of the
required profit margin of 52.4%. In summary, the authors decided that there was not enough statistically
significant evidence to separate the accuracies of experts and numerical systems [43].

3.3.2 Regression Analysis

If the solution to the project’s problem was to use a model similar to those mentioned above, then due to
the number of features and the vast amounts of data are plugged into the system, regression analysis was
required to evaluate the project’s prototypes. Chatterjee and Hadi define regression analysis as “a set of
data analytic techniques that examine the interrelationships among a set of given variables” [20]. These
techniques could be utilised to inspect the features during model training and using the interrelationships
between the features to predict the outcome of unseen matches.
The basis of most statistical modeling involves transforming each observation (sample of data) into
an equation whereby a result value (the dependent variable) is equal to the sum of the model’s features
(independent variables). These features have weights attached to them, whereby each weight represents
the correlation between that feature and the model’s result value. This equation can be seen here: [34]
n
y= ∑ wx × fx (1)
x=1
= w· f

Where y is our dependent variable, f represent the model’s features and w are the features’ associated
weights using all observations x. Thus, using the known variable values (both dependent and inde-
pendent) of the training observations, the coefficients of the weights are found through an estimation
method. Then during the testing of the model, these weights are used in conjunction with the indepen-
dent variable values for each test observations to predict a dependent variable value [34]. With respect to
the problem in hand, clearly each observation is an NFL match where the dependent variable represents
the outcome of that game. Each independent variable represents an item of data relative to the match

21
that could be used to predict the outcome.
One way in which this process can be carried out is linear regression, where the value of the depen-
dent variable is a real number. With respect to this project, linear regression could have been used to
predict the magnitude of victory for NFL games and this would obtain an ‘implicit’ predicted winner
for each game (i.e. positive would be home win, negative is away win). However, this would involve
extra processing of the model that could be avoided if an alternate regression method which produces a
binary result value was used instead. This is where logistical regression was considered.

3.3.2.1 Logistic Regression

Logistic regression is a mathematical modeling approach that can describe the correlation between sev-
eral independent variables to a binary dependent variable [37]. As the solution to the problem will sim-
ply need to predict whether team A or team B wins, logistic regression seems more akin to analysing the
prototypes within this project. The technique of logistic regression is based around the concept of odds
probability. In other words, within the testing of our NFL games, the model will assess the game data
and attain two probabilities, one will represent the likelihood of the outcome being a home win and a
different probability represents an away win. The former is then divided by the latter in order to obtain
the odds ratios [37]. For example, if a game has 0.25 probability that the home team will win and 0.75
the away team will win then the odds ratio would be 0.25/0.75 (a third). In other words, the probability
of the home team winning is one-third the probability of the away team winning (or in bookmakers
terms 3-1 for the home team to win). Thus have we the equation:
P(y=1|x)
ln( 1−P(y=1|x) ) = w · f (2)

The left-hand side of (2) is the logit function and represents our odds ratio [34]. An equation is now
needed to obtain the probability of y being true, using algebraic manipulation on equation (2) we form9 :
1
P(y = 1|x) = 1+e−w· f
(3)

The equation (3) is called the logistic function and maps the values of -∞ and ∞ to lie between 0 and 1
(which will be utilised to attain the odds’ probabilities).

3.3.2.2 Maximum Likelihood Estimation

Clearly, some way was needed to obtain the coefficients of the weights within the logistical model.
One of the more common approaches is known as the Maximum Likelihood (ML) estimation [37].

9 This will not be detailed here, for more information see Jurafsky and Martin [34]

22
This is the process of training the weights to achieve the highest probability of each observed y value
[34]. Kleinbaum and Klein defines “the likelihood function as the likelihood of observing the data that
have been collected” [37]. That is, the probability that is produced by using certain coefficients for the
respective weights w within the model [34]:

w = ∑ log P(y(i) |x(i) ) (4)


i
Where i represents all of the matches within the training data. Therefore during the training of the
model, we need to find the optimal weights ŵ that will produce the highest probability for the outcomes
within the training data [34]:

ŵ = argmax ∑ log P(y(i) |x(i) ) (5)


w i

3.4 Machine Learning Software


If one of the regression techniques was to be used in this project then software would be needed to carry
them out. This software should be able to process vectors of features, where each vector represents
one NFL match. This software must be able to make use of the feature data encompassed within each
training vector and form feature weights to produce predictions for the unseen test vectors.

3.4.1 WEKA

WEKA (Waikato Environment for Knowledge Analysis) is a collection of machine learning algorithms
for use in a data-mining context [48]. It is open source under the GNU General Public License and
also incorporates an API which allows WEKA functionality to be used within a Java program. Given
this, the training and testing of prototypes could be incorporated into the Java scripts which create said
prototypes.
The WEKA algorithms include a Logistic regression class which implements a ridge estimator in
conjunction with the ML estimation described above. This ridge implementation restricts the weights
within the ML process which aids logistic models whereby there are considerably more independent
variables than data observations [19]. As I used NFL match data encompassing more than 20 years, this
scenario was not the case in this project thus this ridge estimator was not needed.
The format of WEKA’s input is in the form of ARFF (Attribute-Relation File Format) files which
are ASCII text files that describe a list of instances relating to a set of attributes [48]. Aside from the
advantage of having Java capabilities, this software was also used within McKinlay’s successful body
of work [39] thus I decided to use it to analyse the prototypes that were created.

23
3.5 Summary of Reading
The focus of this project was to be a logistic regression model whereby each match was represented
through multiple variables and estimated using Maximum Likelihood. I did not come across any re-
search involving a logistic model to predict the outcome of sporting events so I decided to use this
approach and see if it could reach the predictive accuracies achieved by the alternatives mentioned (e.g.
linear, OLS, etc).
The independent variables would be based on historical data or other novel features and be used to
predict the binary dependent variable representative of the match outcome. The need for the model to
produce this binary value was the main reason to use a logistic approach.
The logistic model can be seen to be just as effective as other more complex equivalents. Duh et al.
surmised that the logistic function was no worse in performance when compared to a Neural Network
approach and also claim the logistic function as being simpler and computationally less expensive [23].
This simplicity would aid the project with regard to understanding and analysing the nature of each
feature. Moreover, eliminating the extra computational time associated with using a more complex
model would give me more time to work on prototypes.
In terms of the data that was to be used within the model, I did not include friendly or Playoffs
games. Harville concluded that as they are not competitive in the literal sense, friendly matches are
hard to predict and have little predictive quality anyway [31]. Also Vergin and Sosik found that Playoff
games are very unpredictable and veer away from regular-season conventions [46], thus will also not
be considered here. Furthermore, as mentioned previously, tied games would not be used within this
project (in both training and testing).
This statistical model will make use of the most sensible and most useful features with the view
to attaining the highest prediction accuracy possible. These features will take inspiration from the
models researched within Section 3.3 and from the evaluation of previous prototypes. I also intended
to implement a couple of baseline prototypes which relied on a single variable. Having said this, it was
still my aim within these early prototypes to achieve an accuracy above the random prediction approach
of 50%. The objective then was to assess these basic ideas and if they showed to have some predictive
ability, then they would be incorporated into the more complex logistic model.

24
Chapter 4

Prototypes

4.1 Data Collection


To carry out this numerical analysis, data was needed in the form of previous American Football games
dating back at least 10 years. Obviously, these results had to be accurate and ideally collected from the
same source to ensure consistency. Research into this led to the discovery of one such website1 . An
example of the website data can be seen in Figure 4.1.
Although the records stored here actually went back as far as 1920, the decision was made to only
use results from 1970 onwards. This was based on reasoning that in terms of storage and algorithm run
time, going back further than this date would not be beneficial as 37 years worth of data would suffice
in training and testing. Furthermore, before 1970, a league named the AFL (American Football League)
existed which rivalled the NFL. This meant that professional teams were split across the two leagues up
until the AFL and the NFL merged in 1970.
As each NFL season was situated on a different web page, the next step was to create a program
in Java that would ‘spider’ through the website. Built specifically for this website’s HTML, it iterated
through the different pages and printed all the matches to a text file (footballResults.txt). My specially
created application recorded the date of the match, the home team, the away team and their respective
scores2 . Also printed to the file was the outcome of the game (home team, away team or tie). The
section of footballResults.txt that relates to the input within Figure 4.1 can be seen in Figure 4.2.
As you can see from Table 4.1, the number of draws between 1970 and 2006 were very minimal
which backs up the claim made in Section 3.1 that NFL have very few tied matches. To highlight the
contrast with soccer, this home:away:draw ratio of approximately 46:34:0.5 can be compared to that of
a ratio of soccer matches 46:27:27 [22].

4.2 Prototype 1 (HOME)


The first prototype was a very simple predictor which as the name suggests crudely selects that the
home team will always win. This is based upon a widely-held view that within sports (especially team

1 http://www.pro-football-reference.com/

2 As NFL teams are franchised, some have undergone various name changes over the years, thus all names were converted
to their most recent franchise name

25
Figure 4.1: Pro-Football-Reference.com Data for Start of 1975 NFL Season

Table 4.1: Statistics from the collected match data between 1970-2006

Matches 8063
Home Wins 4631
Away Wins 3387
Draws 45

sports) the home team has some advantage over the opposition [46, 18, 31, 44]. Clearly, this advantage
is not easily measurable and if the away team is much stronger than the opposition then this advantage
will only go so far in helping the home team. However, due to various factors such as the influence of
the home crowd outnumbering the away fans, the travelling involved for the away team and so on, it is
a sensible place to start in determining the winner from a football game. What is more, research has
shown that this naive method can outperform both statistical models and expert opinions [18].
The algorithm (HOME Predictor) which I created simply picks the home team to win in every
match. This simple unsupervised approach performs fairly well and improves on a naive random ap-
proach by over 7% (as shown within Table 4.2). In terms of significance with a random approach3 , these
results were found to be substantially past the 0.05 threshold I’d chosen from the outset.

3 Prediction relied on choosing the team whose name came first alphabetically, incidently attaining 48.1% accuracy

26
Figure 4.2: Text Output from the Data Collection Program. Displaying Match Id, Date, Home Team,
Away Team, Home Score, Away Score and the Winner

4.2.1 Prototype Summary

Referring back to the prototype approached recommended by Hughes and Cotterell [33]:

• I hoped to learn how accurate choosing the home team is within American Football.
• This was to be evaluated by the accuracy when tested on all games and was to be compared to a
random approach.
• It was shown that predicting the home team produced a significantly better accuracy than choosing
the winning team at random.

4.3 Prototype 2 (PREV RES)


After seeing that the home advantage is a clear one in American Football, I tried to see if previous results
between the two teams could improve on the first prototype. This built upon the theory of ‘direct’ data
proposed by Hu and Zidek [32] mentioned in the background reading. This supervised approach should
improve on the simplistic approach seen in Prototype 1 as I am using historical data to predict the
winner.

4.3.1 Design

This prototype took each game and used the previous encounters between the two teams to predict the
outcome. Firstly, I needed to know how much ‘direct’ data to use. Thus, each match used 1, 3, 5 and

27
10 years worth of previous meetings to yield four separate predictions and attain the optimal number of
years to use. Therefore, I split the processing up into running one algorithm four times, each using a
differing parameter (the number of years to go back). Each of these iterations would then be compared
against each other and ultimately against Prototype 1.

4.3.2 Implementation

I created a program (PREV RES Predictor) which iterated through the results text file (footballRe-
sults.txt), taking each match and creating a PrevRes class using the two teams and match year as ar-
guments. This class contained a method which took the number of years in which to visit and iterated
through the results to get previous matches4 between the two teams within that designated period. This
process was called 4 times for each match (using 1,3,5 and 10 years as arguments). Then the aggregation
of the number of wins for each team was found, with the higher of the two aggregations becoming the
predicted winner of the match5 .
To increase efficiency of the algorithm, rather than constantly iterating through the results text file,
the results were stored in a result matrix. The result matrix simply involved storing all the match-ups
that had happened since 1970 and recording win tallies between the two teams for each year since then.
The matrix is a Java HashMap containing all the different combinations of teams (teamA vs teamB) that
have played each other since 1970. Then within each match-up in the HashMap is another HashMap
containing tallies (number of teamA wins, number of teamB wins and the number of draws) for each
year since 1970. An fragment of this matrix can be seen within Table 4.3. Prototype 2’s algorithm using
the matrix can be seen in Algorithm 1 (Appendix C). Here we see that it only iterates the list of matches
within footballResults.txt once, this is much more efficient than doing so every time a result needs to be
found.

4.3.3 Evaluation

Table 4.4 details the breakdown of how each year-set of previous results faired. This tells us that al-
though previous results are a useful piece of information, the further back the meetings go, the less
accurate the feature becomes. This is shown as 10 years worth of previous results attains a prediction of
55% whereas only using last season’s results yields an accuracy of 58%. Although this 58% marginally
beats the first prototype (Table 4.2), I failed to reject the null hypothesis as they were not significant.

4 If a previous meeting ended in a draw, the encounter was ignored


5 If the two teams won the same amounts of games then the algorithm would fall back onto Prototype 1 algorithm

28
This chronological decrease in accuracy is because information about a certain team will be more
accurate the nearer it is to the present year. Take for example if Miami Dolphins beat the Dallas Cowboys
at home 10 years ago, many factors may have changed since then. It’s likely that no players that played
in that match still play for either team, the Dolphins might play their home games in a different stadium
and so on. Therefore that game is not an accurate representation of the two team’s current situation and
thus it is not a good piece of data to predict the outcome of a present day game.

4.3.4 Prototype Summary

Referring back to the prototype approached recommended by Hughes and Cotterell [33]:

• I hoped to learn how accurate predictions were solely reliant on previous results between the two
competing teams. Whether using more/less years of results affected this accuracy and furthermore
to discover whether this data held more predictive qualities than just choosing the home team.
• This was to be evaluated through the accuracy tested on all games when using 1, 3, 5 and 10 years
of previous results. Comparing these 4 iterations and contrasting the most accurate with Prototype
1.
• I learnt that the more data used in this scenario, the less accurate the predictor became. This also
showed that using last year’s results equalled that of predicting the home team to win.

Table 4.2: Prototype 1 & 2 Tested On 8018 Matches between 1970-2006

Prototype Accuracy(%)
1 (HOME) 57.8
2 (PREV RES) 58.0

Table 4.3: Part of Result Matrix Showing (Team A Wins-Team B Wins-Ties) Tallies From Various Years

Team A vs Team B 1970 1971 ..


Atlanta Falcons Miami Dolphins 1-2-0 1-0-1 ..
Dallas Cowboys Pittsburgh Steelers 2-0-0 0-1-0 ..
.. .. .. ..

4.4 Prototype 3 (Goddard & Asimakopoulos Model)


After assessing the results from the previous two prototypes, the accuracies were higher than a random
prediction approach but the question was asked, can more complex models improve on these accura-

29
Table 4.4: PREV RES Results Using Differing Amounts of Data

No. of Years Accuracy (%)


1 58.0
3 56.6
5 55.8
10 55.0

cies? This can be seen from a least squares approach which achieved 68.4% for predicting a winner
in an NFL match [44]. Harville achieved even higher than this when he implemented his mixed-linear
model to carry out the same task (70%) [31]. It was clear that multiple features would need to be in-
corporated into one statistical model to predict the outcome of matches with a high success rate. This
supervised approach (more complex than Prototype 2) would therefore be look at previous results to
assess relationships between these features and the outcome to predict the outcome in other matches.
Taken from research within Chapter 3, the statistical model put forward by Goddard and Asi-
makopoulos within [27] was considered. As well as the authors claiming that the features within this
model held information that could be used to successfully predict soccer matches, they found that they
could even help the bookmakers make more accurate predictions. Although the model within [27] is
used to represent the outcome of an English soccer match, I felt that the features within the model are
transferable to most team sports.
This was down to soccer and American Football matches sharing a lot of common traits. Both
involve two contesting teams consisting of a certain number of players trying to score the most points
within a set amount of time. The teams from both sports play matches home and away on a frequent
basis within a structured league where a team’s home games are played at a regular venue. Furthermore,
as mentioned previously, the research that I came across within the field of soccer and NFL involved
references to the other. Thus, I looked into the features of the Goddard and Asimakopoulos model.

4.4.1 Design

Research within [27] showed how the outcome of an English football game could be represented through
features within an ordered probit regression model. Although a probit approach is not identical to the
logistic model, the ideas and theories of the independent variables can still be use within this project.
These variables were then adapted to fit the model of American Football. The features that Prototype 3

30
used to predict an American Football match between home team i and away team j in year k were:

• The win ratios for 2 years previous to k (for i, j) -


This looked at the percentage of games a team won within a season, carried out for the two seasons
previous to year k which produces two percentages. This gave some indication to how successful
a team has been against all other teams within previous seasons.
• The mth recent home results (for i, j) -
This used the results of the previous m home games of both teams. [27] found that 9 was the
optimal value of m within soccer. This value was initially used within this model, however at that
stage it was not known whether this value was optimal. The idea of analysing recent games can
also be seen in [29], whereby they used the last four games in assessing how teams faired against
the spread thus emphasising the predictive quality of a team’s form.
• The nth recent away results (for i, j) -
This used the results of the previous n away results of both teams (similar to recent home games,
it was found 4 was the optimal value of n [27] and therefore this value was used).
• The geographical distance between cities/towns of i, j -
This was related to how far the away team would have to travel to play the match. This was based
on the hypothesis that the further a team has to travel, the least likely they are to win. When
modeling soccer matches to compare with expert opinions, Forrest and Simmons used ‘distance’
as one of their features, highlighting that it could be a factor in predicting the outcome [26].
• The capacity of the stadium in which i and j played their home games during year k -
Within [27], a feature was present which was a residual for a certain team based on average home
attendance and their final league position. This was referred to as the ‘big team effect’. Thus if a
team has a above-average home attendance and finished high in the table then it is likely to be a
‘big’ team. It works on the premise that ‘bigger’ teams are more likely to beat ‘smaller’ teams.
As average attendances that dated back more than a year were difficult to find and the fact that
NFL teams are split into small separate tables, this feature was hard to replicate. Hence, I decided
to use the size of stadia that the teams i and j played in within year k. In other words, a team
playing in a large stadium will only have such a stadium if they have enough fans to fill it. Thus
if a team has a lot of fans then it is reasonable to assume they are a successful (or ‘big’) team.
This feature will also give an indication of how the size of the home team’s stadium effects the
opposition performance (building on work by Vergin and Sosik [46]).

31
• The result of the corresponding fixture last year between i and j -
This looked at whether the teams played each other in the year previous to k with i playing at
home and j playing away. If the teams played two games within this scenario in year k-1 then the
more recent result was used. This also made use of the information found within Prototype 2 by
limiting this feature to just one previous year.

4.4.2 Implementation

A program (VectorModelCreator Goddard) was created which used the results text file to created vectors
representative of each match involving all of the features mentioned in the previous subsection.

4.4.2.1 Feature Extraction

I created a class called Match which, at the point of construction took in information such as the match
year, the two team names, the winner, etc. Thus, VectorModelCreator Goddard took each game and
created a new instance of the Match class using information attained from the text file. This Match class
then allowed access to methods that were used to acquire the features.
The values were written to a text file in the form of a vector with the year of the match at the start and
each feature being separated by a comma. Throughout the process of calculating the various features, if
it was found that one feature did not have a value (for example, if the two teams did not play each other
the year before) then I implemented WEKA’s ‘missing value’ function. This missing value (stored in
the vector as ‘?’) uses a mean value within the data set for that feature.
The features mentioned within Section 4.4.1 were implemented by the Match class in the form of
different methods:

• Win ratios - This made use of the result matrix again. The method took a year and a team name
as parameters and iterated through all of the match-ups within the matrix to check if one of the
teams within the match-up was the team in question. If so, then the result tally from the specified
year was extracted. Subsequently, the tally values were obtained and used to accumulate all the
wins, losses and draws involving that team within that year. When all the match-ups were iterated
through, the win ratio was achieved by the number of wins divided by the total number of games
involving that team within that year. This process was then carried for all the years previous to
the match year (up to 1970) for team i and j. All the previous win ratios were recorded as this
would allow scope for the number of previous win ratios to be increased. The process of ‘cutting’

32
all these win ratios to just the previous two years was carried out within the training and test set
creation (Section 4.4.2.2).
• Recent home or away games - This algorithm could not make use of the result matrix as I needed
x recent home/away games. To store the date of the matches within the result matrix would
complicate the storage of matches and contradict the purpose of the matrix itself. Therefore it was
decided to take the results text file and create a new version in which it would be reversed. This
involved writing a separate Java program which started from the end of the original results text
file and iterated backwards writing the same information to a new file (reverseFootballResults.txt).
This enabled an algorithm to iterate through this file until the match in question was found, then
it carried on searching the file, extracting home/away results (1 for win, 0.5 for draw, 0 for loss)
involving that team until it reached the limit specified by m and n respectively.
• Distance between teams - I represented the geographic distances in miles between two teams
at any point between 1970-2006. This may not seem too difficult within some sports as when
teams move stadium, they do not move more than 20 or so miles away from their current location.
However, as American Football teams are franchised, it has been known for teams to move over
500 miles to a new location [15]. Thus for each match-up since 1970, data was needed to represent
the distance between the two teams at that point in time.
Firstly, I recorded which cities/towns each team had played in between 1970-2006. This involved
using a web page which listed all the NFL teams (past and present) and where each of the fran-
chises had been based over the years [1]. This was used to get a list of cities/towns which could
have be involved within a match between 1970 and 2006. The distances between these places
were now needed. One distance matrix collated by an NFL stadium website provided a lot of
distances between NFL-hosting cities [4]. The rest were obtained from an on line tool which
calculated distances between two cities [2]. These two sources enabled me to manually enter the
distances between all cities within a text file.
To cut down on time, if it was found that two disparate cities were less than 25 miles apart then
they would be ‘clustered’ into one city (for example Miami Gardens and Miami are both referred
to as Miami). As some travelling distances reached well over 500 miles, anything less than 25
miles was viewed as nominal. A further note about this process is that some teams have been
known to play their ‘home’ games at another stadium (sometimes in a different city) for one
match per season. Accounting for these small number of matches would have been more costly

33
in time than the value extracted from carrying it out thus they were ignored.
Once the text file with all the distances had been created, I created a Java class called Stadium-
Calculator. This took three parameters, the two teams playing each other and the year in which it
was played. Then a method within the class would firstly discover the two cities in which the two
teams resided during that year. Subsequently, the method would parse through the distance text
file and find the value associated with the two cities6 .
• Stadium capacities - Again, this feature suffered from the same difficulty with franchises moving
from city to city and from stadium to stadium. Fortunately, the web page which was used within
the distance feature extraction also contained stadia information for each franchise (including ca-
pacities) [1]. To try to ensure that this data was accurate, I checked 5 current capacities from the
respective NFL team’s official website. As I could not check all the figures, I felt the 5 confirma-
tions were sufficient. Regarding the issue of teams playing at random stadiums at irregular points
in time, the collection of this data was also carried out manually to ensure correctness.
To try to eliminate any further processing time, I decided to place the capacities into multiple ‘if’
statements rather than have the Java program read through another text file. This involved going
through the web page and for each team, assessing what stadium capacity they had at a certain
point in time. This information was then represented in a method within the StadiumCalculator
class which when given a team and a certain would search through the ‘if’ statements until the
capacity was found.
• Last year’s meeting - This took advantage of the reversed results file in extracting not only the
corresponding fixture last year but the most recent (if there was more than one). Therefore within
the Match class, this method searched through the reverse results text file (starting at the year
previous to the match year) and found the first occurrence of the two teams playing each other
(where the home and away teams are the same). If the home team won, 1 was returned, 0 for an
away win and 0.5 for the tie.
• The result - Finally the vector was ended with the outcome of the match. If the home team won
the match then 1 was appended to the vector whereas if the away team were victorious, 0 was
used7 .
6 If the two teams played in the same city then the method would simply return zero
7 As previously mentioned in Section 4.2, ties were not counted within the training or testing

34
4.4.2.2 Training and Testing Set Creation

Once each match had been represented as a feature vector, the vectors needed to be placed into training
and testing sets. Thus, an optimal number of training years was needed. As the data spanned between
1970 and 2006, I set an upper limit of 20 training years because some years would need to be ‘held
back’ for testing. Therefore, I assessed different training years (every even number between 2 and 20)
where each set was tested on the same years.
Testing started with matches within the year 1999 using the previous 2 years training data (1997-
1998), using the previous 4 years (1995-1998) and so on upto 20 years (1979-1998). This same process
was then carried out for test years upto 2006. The reason for starting testing at 1999 was that the training
data in the early 1970s had some incomplete data (e.g. previous win ratios). Furthermore, I had already
thought about introducing Sagarin’s ratings within a future prototype (which holds data from the end of
1998 onwards). Having said this, 8 years of testing was plentiful, especially when compared to some
of the small test periods found within my background reading. An example of how 20 years worth of
training data was tested can be seen in Table 4.5.
This set creation involved me writing a program which took the text file containing all the vectors
and using the year at the start of each vector, placed the relevant vectors into appropriate training and
testing text files. This process also involved the trimming of all the previous win ratios to just the two
previous to the year of training/test file currently being constructed. This involved using the focused
year as an index within the vector and taking the two previous indices.

4.4.2.3 WEKA Vector Convector

To be able to be processed by WEKA’s software, each training/test file had to be represented in the
format required for an ARFF file. Therefore within the Training and Testing Set Creation program, each
vector was modified and placed into the correct format. Firstly, this involved removing the mark-up
around the vector (i.e. the year and ‘[’ ‘]’). Another proviso with the ARFF file format is that each
feature needs to be explicitly declared at the start of the file. When this process was completed, the
program outputted 8 training ARFF files for each training set (2,4..20) and 8 test ARFF files.

4.4.2.4 Data Analysis Using WEKA

As mentioned previously, WEKA offers a Java library which enables the use of its classes and functions.
Therefore, it would be more efficient to run the training and testing through a Java program rather than
through the conventional Explorer GUI. Thus, I created a Java class named WekaWrapper where an

35
Table 4.5: During Optimisation of Prototype 3, Example of How The Model Was Tested When Trained
with 20 Years of Match Data

Iteration Training Years Test Year


1 1979-1998 1999
2 1980-1999 2000
3 1981-2000 2001
.. .. ..
8 1986-2005 2006

instance of the class could be created for each prototype. A method then took each of the 8 training files
and built 8 separate logistical classifiers (using the Logistic class provided). These classifiers were then
evaluated on their corresponding test files. This process was carried out for each set of differing training
years (2, 4, 6, etc.). These results can be seen within Table 4.6.

Table 4.6: Accuracy of Different Training Years in Prototype 3

No. of Training Years Accuracy(%)


2 59.8
4 61.3
6 61.5
8 60.8
10 60.9
12 61.2
14 60.6
16 60.7
18 60.9
20 61.8

4.4.3 Evaluation

As we can see from Table 4.6, Prototype 3 when trained on 20 years of data, achieved the highest
accuracy of 61.8%. This eased an initial worry with using a large amount of training data that maybe
trends within the NFL had changed over the years (e.g. maybe home advantage had become less/more

36
significant). Although using 4 or 6 years worth of training data attained similar success rates, I decided
to use 20 years worth of match data henceforth. This was due to the fact that I had access to enough
data to sufficiently test matches using 20 years of training data.
The accuracy of the prototype (trained on 20 years) can now be compared to the accuracies found
within the first two prototypes. The first two prototypes were originally tested on all data, but to accu-
rately compare these with Prototype 3, they were re-assessed only using matches from 1999 onwards.
This lead to revised accuracies of 57% and 57.2% for Prototype 1 and 2 respectively. This suggests that
the Goddard and Asimakopoulos model can be used help predict the outcomes of American Football
games with a higher accuracy than simply picking the home team or using last year’s result. Prototype
3’s difference in accuracies with the first two prototypes were seen to be highly significant, both at the
0.001 level (far past the designated 0.05 threshold).

4.4.4 Prototype Summary

Referring back to the prototype approached recommended by Hughes and Cotterell [33]:

• I hoped to learn whether a complex logistic model could out perform simple prediction baselines.
• This was to be evaluated by the accuracy tested on games between 1999-2006 (using the optimum
number of training years) and comparing with Prototypes 1 and 2.
• I showed that a logistical model of novel numerical features achieved significantly superior pre-
dictions than relying on the home team or the previous result between the competing teams.

4.5 Prototype 4 (Inclusion of Ranking Features)


After assessing the first 3 prototypes, it was clear that the model within Prototype 3 held high predictive
qualities. Therefore, the decision was taken to build upon Prototype 3 to see if additional features could
improve the accuracy of the model. As mentioned within the Background Reading Chapter, power
scores/ratings/ranking can be used to enable accurate predictions of NFL games. This was mainly shown
within [18] which found that they competed well with NFL betting spreads. Furthermore, Harville
claimed that the basis for successful sporting forecasting is through a type of rating system [31]. With
this in mind, I decided to search for NFL rankings.

4.5.0.1 Jeff Sagarin’s Power Ratings

As previously mentioned in Section 3.1.4, initial research found power ratings calculated by Jeff Sagarin
[16]. Sagarin is a well-respected statistician who also creates similar stats for other North American

37
sports such as ice hockey, basketball, etc. He currently works for USA Today providing these power
rankings every week and has done so for some time [18]. His ratings are held in such regard that they
are used in helping to decide the final league positions within American college football [18]. I was
not able to find how these ratings are calculated but they are widely used and therefore must have an
indication on the strengths/weaknesses of a team, thus could be used to achieve a reasonable forecast of
a match outcome. A section of the website containing ratings for the 1999 NFL season can be seen in
Figure 4.3.

4.5.0.2 Football Outsiders

Although Sagarin’s rating had promise in terms of their forecasting ability, I decided to get a second
set to see if these helped, or even held better prediction qualities than Sagarin’s. Hence, I came across
Football Outsiders which is a website based on analysing American Football through numerical statistics
[5]. They recently entered into a partnership with the huge American sports broadcast network ESPN
[6] which signifies that they (like Sagarin’s ratings) are highly thought of. Unlike Sagarin, Football
Outsiders give a fairly detailed description of how they attain their figures.
Their ratings centre around a figure called DVOA (Defense-Adjusted Value Over Average). This
breaks down each play within the NFL season to see how much success each player achieves compared
to the league average. They claim their statistics are better than official NFL statistics8 as they take
into account the importance of each play unlike the official records [7]. Imagine that a team are at
third-down and 4 yards away from making the next down. If the quarterback makes that 4 yard pass it
has much more significance than if it was a 4 yard pass on first down. Football Outsiders make note of
this importance whereas the official NFL statistics treat both as simply ‘a 4 yard pass’. This DVOA is
thought of as a team efficiency statistic, hence it can be used in a similar vein to power scores/ratings9

4.5.1 Design

Both ratings from Sagarin and Football Outsiders (now referred to as FO) needed to be incorporated
within each match vector for the home and away team. The first step was to acquire the ratings from the
respective websites. As previously mentioned, Sagarin’s stats are archived from 1998 onwards whereas
FO held data reaching back to 1995. Now, in the case of both of these ratings, only the final statistics of

8 The NFL league keeps records of statistics throughout the season such as how many yards each quarterback has completed

with his passing


9 To find out in more detail how DVOA is calculated, see http://www.footballoutsiders.com/info/methods#dvoa

38
each year were available (i.e. the ratings after the end of the final week of each season) whereas these
statistics are supposed to be used (and updated) on a weekly basis. Thus, when placing these ratings
in a match vector played in 2003 for example, I used the final rating of the team within year 2002 to
help predict the outcome. I felt this should not be a problem because if a team finishes stronger in one
season then that momentum can be carried over to the next season. Furthermore, in soccer (specifically
the English Premier League) the same teams generally are found to be successful. This is shown by the
same teams usually finishing in the top 8 every year. Thus, I based this prototype on the theory if an
NFL team was successful in one year it will more than likely be successful in the year following.

Figure 4.3: Jeff Sagarin Ratings for the 1999 NFL season

4.5.2 Implementation

I created two ‘web spiders’ for each website which would automatically obtain the ratings for each team
within a certain season. Similar to the data collection spider (found in Section 4.1), these programs were
specific to the HTML found within each website. Both of these programs parsed this HTML to get the
relevant ratings from the different years, where each year was located on a different web page within
both two sites. An example of the output from the Sagarin spider can be seen in Figure 4.4.
These ratings were stored in two separate text files meaning they could simply be searched through
to find the rating for a team within a certain year. This functionality was added in the form of two
new methods within the Match class (one for each rating). So for each of the competing teams within
a match, the program simply called the two methods to extract the Sagarin and FO rating for the year

39
Figure 4.4: Text Output From The Sagarin Rating Collector Program. Displaying The Year, Ranking,
Team Name and Sagarin Rating

previous. These figures were then added to the match vector along with Prototype’s 3 features.
As I had data from 1998 onwards for Sagarin’s stats and 1995 onwards for FO, I was forced to use a
smaller test space than the one used in Prototype 3. This meant that the prototype (still being trained on
the previous 20 years of data) was tested on matches between 2003-2006. This gave the rating features
enough time to be assigned appropriate weightings during logistic training.

Table 4.7: Prototypes Tested On 924 Matches between 2003-2006

Prototype Accuracy(%)
3 (Goddard) 62.0
4 (Goddard with Rankings) 61.2
4.1 (Just Rankings) 58.2
4.2 (Just Sagarin) 58.1
4.3 (Just Football Outsiders) 58.0

4.5.3 Evaluation

The non-specific nature of the WekaWrapper class I created, allowed me to simply ‘plug-in’ in the
ARRF files outputted from Prototype 4’s training and test set creator to assess its accuracy. However, as

40
differing test years were used here, I had to re-evaluate Prototype 3 using the last four years of data in
order to accurately compare. The results from these tests can be seen from Table 4.7.
This shows that the ratings that were introduced into the Goddard and Asimakopoulos model had
a detrimental effect on the prototype’s forecasting ability, however these accuracies were not found to
be significantly different. The first thing was to assess whether this lack of improvement was due to
Prototype 4 being trained on a large amount of data that held no rating information. For example, a
training set encompassing 1984-2003 would only have FO training data within match vectors between
1996-2003 and Sagarin’s between 1999-2003. Match vectors between 1984-1995 would have ‘missing’
(mean) values where the ratings are stored. This could have an adverse effect on the training of the
prototype.
Consequently, I tested Prototypes 3 and 4 on the same years as above but using the previous 4 rather
than 20 years of data as training. The choice of 4 years was based on the fact this was found to be one of
the more accurate year-sets during the optimal testing of Prototype 3 (Table 4.6). This would eliminate
the vast amounts of mean values used within the training of Prototype 4. However, this lead to both
prototypes becoming less accurate than when trained on 20 years and Prototype 3 still being a better
forecaster than Prototype 4.
Further investigation was required, so I ran a model with just the ratings on their own (without the
features from the Goddard model). Thus using 20 years of training data, this was tested on the same 4
years as above (2003-2006). As shown in Table 4.7, this prototype (4.1) achieved an accuracy of 58.2%
and was found to be significantly different to both Prototype 3 (at the 0.2 level) and Prototype 4 (at the
0.10 level). Furthermore, although this ratings model was found to be more accurate than Prototype 1
and 2 on these test years, it was not significant to either of them. This suggests the ratings that were
added to Prototype 3 are a poor indicator as to which team will be victorious over another.
I decided to carry out additional analysis to see if one set of ratings outperformed the other with a
view to removing one set. Two more temporary prototypes, 4.2 and 4.3 were then created using just
Sagarin’s ratings and just FO stats respectively. I established that Prototype 4.2 achieved an accuracy
of 58.1% and Prototype 4.3 attained 58.0% (see Table 4.7). Although Prototypes 4.1 and 4.2 were not
found to be statistically different from each other, we can probably hypothesise that FO team efficiency
ratings are no better than Sagarin’s ratings as FO have 3 more years worth of data within the training
of the models and still achieves a similar accuracy. However, as we can see from this, there is minimal
difference between using the ratings individually and using them together.

41
I came to the conclusion that these rankings (as mentioned before) are supposed to be used on a
weekly basis to predict the winner of a match within the week. The ratings are updated after said match
and then used to predict the next game and so on. I only had data that represented the team’s strength
at the end of each season and clearly this is not a good indicator of how well a team will perform in the
following season.
A reason for this was highlighted by Koning’s analysis of competition within sports. He claims that
the NFL draft system is very important in keeping a competitive edge between teams within American
Football [38]. The draft is the process of teams picking the best up-and-coming college footballers in
the off-season. This involves the worst teams from the previous season getting first pick and the best
teams picking last [11]. This will affect the way Prototype 4 is carried out as a team that is successful
in one season has no guarantee of success in the next season due to this draft system. This along with
other factors such as teams getting a new coach, teams being unable to maintain last season’s form, etc.
This can be shown by comparing the number of disparate winners of NFL’s Super Bowl with that of the
English Premier League (EPL). The number of differing champions since 1995 in American Football is
11 [13] whereas there have only been 3 within the EPL during that time.

4.5.4 Prototype Summary

Referring back to the prototype approached recommended by Hughes and Cotterell [33]:

• I hoped to learn whether adding power ratings to Prototype 3 could improve its accuracy.
• This was to be evaluated by the accuracy tested on games between 2003-2006 and comparing to
Prototype 3’s accuracy within these years.
• I showed that the power ratings did not improve Prototype 3 as they must be used on a weekly
basis to have predictive qualities.

4.6 Prototype 5
I concluded that as rating data was not available for each week within the test years, the next prototype
should not incorporate the rating features. Therefore my next task was to find other features that would
improve the accuracy of the model that was found within Prototype 3.
A lot of the research that I covered within Section 3.3.1.1 made use of the score differences involved
in recent games rather than just using the results [22, 41, 31, 44]. Goddard and Asimakopoulos surmised
that only using the results made the model simpler and made it more suited to soccer as most victories lie

42
between 1 or 2 goal deficits anyway [27]. However, NFL games can have a score difference of anything
between 0 and 40+.
Subsequently, I decided to replace the recent results (win/lose/draw) from Prototype 3 with the actual
difference in score within those matches. The magnitude of victory/loss for a team is more informative
than simply whether they won or lost. Imagine a scenario where two teams have won their past 5 games
where team A won each game 30-0 and team B won their 5 matches 5-0. Although the current model
would not see a difference in these two sets of recent games, a sensible prediction would be to choose
team A to win. This is because although both are in good form, team A looks to have won their matches
with greater ease (suggesting they are the stronger team).

4.6.1 Implementation

This meant I needed to add another method to the Match class. This method was similar to the one which
calculated the recent results, except this time it was required to store the magnitude of victory/loss for
that team (where 0 indicated a draw). Also as stated, the ratings were removed from the match vectors
which was done whilst sorting said vectors into training and test sets.

4.6.2 Evaluation

The lack of ratings enabled me to expand the test data back to 8 years (1999-2006). Therefore, the
structure by which this prototype was trained and tested is the same found in Table 4.5. During the
testing of this prototype, an accuracy of 62.5% was averaged over the 8 test years. This improved the
accuracy of Prototype 3. However, this was was not found to be significantly different.
In a final attempt to improve the model, I restricted on the magnitude of victory within the ‘recent
game’ features. This is in reference to work found within the Chapter 3 by various authors [30, 31, 41].
By not allowing teams to ‘run-up’ the score, these authors state that more accurate data is extracted from
previous games as points/goals scored past a certain point are not informative. After looking through
the various score differences within Prototype 5’s match vectors, I looked for a sensible threshold which
would bound around 10% of the score differences. This lead me to converting all the values of re-
cent game score differences which were over 25, to 25. After re-testing, unfortunately I failed to find
significant differences in performance between this and Prototype 5. Thus, although the restriction on
score differences have been considered to be more informative, I was unable to prove this and did not
consider it henceforth. In conclusion, although Prototype 5 was not found to be significantly better than
Prototype 3, I feel recording the differences in scores over the actual result is more suited to American

43
Football.

4.6.3 Prototype Summary

Referring back to the prototype approached recommended by Hughes and Cotterell [33]:

• Discover whether score difference in recent matches is more informative than just the result.
• This was to be evaluated by testing on games between 1999-2006, comparing with Prototype 3.
• I could not prove this to be the case but decided this was a more suitable model when dealing with
NFL predictions.

4.7 Evaluation Against Betting Market


I have already mentioned within my background research that a lot of authors compared their models to
that of the bookmakers predictions. As shown above, differing sets of training and test data can obtain
differing accuracies, e.g. some test years achieve better forecasts than others. Therefore, testing against
predictions made by bookmakers will give an idea of how accurate Prototype 510 actually is. Within
the data obtained from Prof. Gray (see Section 2.5.2), matches had a spread attached which held the
prediction associated with that match.
As the prototype needed 20 years to train and the betting data only went up to the end of the 1994
season, only games between 1990 and 1994 could be tested. Thus, these 5 years were tested using the
previous 20 years as training data. Table 4.8 shows that the betting line again is the superior predictor
when it comes to sporting predictions. Having said that, the results here are not statistically significant
to the pre-determined 0.05 level11 . In other words, even though the betting line was a slightly better
forecaster, it cannot be rejected that this higher performance occurred through chance.

Table 4.8: Prototype 5 & Betting Line Tested On 995 Matches between 1990-1994

Forecaster Accuracy(%)
Prototype 5 65.2
Betting Line 67.4

10 Without the score difference restriction


11 It was only significant at the 0.10 level

44
Chapter 5

Evaluation

5.1 Quantitative Evaluation


5.1.1 Overall Prototype Evaluation

Features which were proposed to predict the winner of a soccer game [27] were the cornerstone of
this project’s work in predicting the winner of American Football games. These features when placed
into a logistic regression model gave substantially more accurate predictions than simply choosing the
home team or relying on previous results between teams. Subsequent prototypes were extensions and
modifications of this model. However, the analysis of these prototypes showed that the main predictive
qualities were from the features suggested within [27].
The final prototype, which included the score difference of a team’s recent games attained an ac-
curacy of 65.2% during matches played between 1990 and 1994. This was beaten by the bookmaker’s
predictions, although these results were not found to be significant. This prototype accuracy was close
to that achieved by Stefani’s least square model which achieved 68.4% but this was bettered by the
bookmaker’s prediction accuracy of 71% on the same matches. When using a much more complex
linear model to forecast NFL games, Harville produced a superior performance than Prototype 5 by
correctly forecasting 70.3% of the NFL matches within his test set [31]. However, this again was beaten
when compared to the 72.1% achieved by the betting line. In conclusion, although the accuracies of
two alternate numerical models were found to be higher than this project’s prediction success rate, they
were beaten by their respective betting lines. However, this project’s model was found to be no worse
when compared to the predictions of the bookmakers.
My model can also be seen to be more accurate than American Football expert’s predictions. Within
[43], the authors recorded the average expert prediction for NFL winners as being 62.2%. Furthermore,
Bouiler and Stekler found the New York Times Editor to have an even poorer accuracy of 59.7% [18].
The fact that my numerical system does not hold idiosyncratic opinions like experts do, will aid the
model in making more accurate unbiased forecasts.

45
5.1.2 Usefulness of Features

5.1.2.1 Feature Ablation

I investigated the usefulness of each feature within the final prototype. This highlighted areas within the
model that could extend this report’s work. This investigation was done through feature ablation, which
can be executed in two different ways. If we assume the number of features is f , then the first approach
is to test the prototype f times, each time taking out a single feature. This is done for each feature and
will determine how the model fares without that attribute. The alternate method within ablation studies
is to run the prototype f times but this time using only one feature (again done for each feature). I
carried out both approaches, however the latter technique produced few conclusions due to each feature
being uninformative on their own. I will only discuss results using the first feature ablation approach.
I carried out this process for Prototype 51 on matches between 1999 and 2006 using the 8 iterations
of training seen in Table 4.5. For each iteration, this entailed training and testing the model f times,
each time removing a different feature and obtaining f average accuracies for each version of the model.
These accuracies were then compared with Prototype 5’s accurate of 62.5% found in Section 4.6.2.
A damaging feature is one which when taken out leaves the model more accurate than the original.
One strategy is to find all the damaging features and remove them from the model to re-assess its
accuracy. Although no features were found to be damaging, results from this process can be viewed in
Table D.12 . Here we see the lowest accuracies such as the score difference in the home team’s 5th recent
home game, when removed from the model achieving 61.036% accuracy. These variables are seen to
be the most valuable to the overall model. Whereas the features displayed at the bottom of Table D.1
represent features which are closer to the accuracy of the original model suggesting they are the more
redundant attributes (e.g. the score difference in the home team’s 3rd recent home game). However, it
should be noted that only 3 values were found to be statistically significant from that of Prototype 5.
Here we can confirm one theory from within the project being that last year’s result between the
teams is a useful feature in predicting an NFL match (Section 4.3). This is shown by the accuracy
decreasing (by admittedly a small amount) and this decrease being statistically significant.
These results also suggest that the amount of recent home game data could be bound in future
prototypes. This is due to the fact that the 6th recent home game for the away team and the 7th recent

1 Without the score difference restriction


2 Feature key seen in Table D.3

46
home game for the home team are some of the more ‘damaging’ features. This suggests that if the
number was bound to maybe 4 or 5 games instead of Goddard’s original proposal of 9, this may improve
the accuracy of the model. This could be implemented into a future prototype, however if this was to
be done, further study would be needed to find the optimal number of games to use. Overall though,
the differences in accuracies are fairly minimal indicating that maybe the model does not contain any
outstanding or redundant features, thus I investigated further.

5.1.2.2 Ranking Of Features

Weka allows another way of assessing how effective each model’s feature is, the Attribute Selection
option. This was carried out using the Ranker search class in conjunction with the InfoGainAttributeEval
evaluator. This evaluates the worth of a feature by the information that is gained with regard to the result
of the match, these features are then ranked in order. The features within the training data (1979-2005)
for Prototype 5 were ranked by this process and can be seen in Table D.2.
Here, we see the win ratios of the home and away team being an important feature within the model
with the previous year ratio (e.g. homewinratio 1) being more important than the ratio for the year before
that (homewinratio 2). Furthermore, it can be seen that the recent home games are more informative
when they are nearer to the current match (homerecentawayscdif1). It is when we move down the table,
we see the nth recent games increase (homerecenthomescdif9). This backs up suggestions made in the
previous subsection that the recent home game features could be bound to a lower number.
This method slightly contradicts what was found within the previous subsection as last year’s result
is shown to not be a very valuable feature. Although the informative quantity was found to be statistically
significant, it was discovered to be one of the least valuable features in the analysis. Thus, the variable’s
value to the model is somewhat inconclusive.
Overall, I have shown that the win ratios are very important when deciding the outcome of a match,
with the home ratio being a larger factor than the away team’s equivalent. Maybe the number of previous
win ratios could be extended in conjunction with a bounding of recent home games within a future
prototype. The features regarding the stadia capacities and the travelling distance were suggested to be
worthless within the feature rankings. Further analysis would be required to confirm this3 .

3I also analysed the weight coefficients obtained by the logistic training, however the results were peculiar and thus incon-
clusive

47
5.2 Qualitative Evaluation
5.2.1 Project Evaluation

Originally the project was to utilise Natural Language Processing to extract expert predictions with a
view to help the forecasts produced by the numerical model. The required data was not found and thus
this aspect of the project had to be abandoned. I could have created a web crawler which iterates the
Internet in order to find this data. However, this would take time to develop and weeks, maybe even
months to execute and find the relevant text. Having said this, as discussed in Section 5.1.1 maybe these
expert opinions would not add any predictive quality to the project’s current model.
With relation to the soccer model proposed by Goddard and Asimakopoulos, an option could have
been to compare the findings in [27] with that of my model. This would have analysed the differences
in predictability between soccer and American Football. However, the main priority of this project was
to analyse the predictive quality of the model in the domain of American Football and this was done
through the comparison with the NFL betting market. On the other hand, if the project was extended
then this comparison could be executed to compare the two sports4 .

5.2.2 Objective and Minimum Requirements Evaluation

Referring back to the original objectives found within Section 1.3, we see that I have carried out each of
these with a certain degree of success. I have shown understanding of which information can be used to
predict the outcome of an American Football match. This was seen within the features that were chosen
to create Prototype 3. I looked into how different techniques were used to model a match within my
Background Reading Chapter and decided to use a regression model to utilise Prototype 3’s features
(thus creating a model used to predict a match). Finally, by comparing this model to bookmaker’s
predictions I have analysed how successful the approach chosen was.
This report shows how I fulfilled all of my minimum requirements (Section 1.4). I developed and
implemented an existing sports prediction algorithm [27] within Prototype 3, to which this was enhanced
with the two following prototypes. Furthermore, Section 5.1.2 details feature ablation studies highlight-
ing the most useful features within the model. Lastly, I gave prediction accuracies of all prototypes
throughout the project showing critical analysis of each algorithm.

4 The model in [27] used a different evaluation system than the one seen in this project

48
5.2.3 Project Extensions

Outside of the minimum requirements detailed within Section 1.4, I also created baseline algorithms
(Section 4.2 & Section 4.3) by which the more complex prototypes could be compared to. Moreover, I
compared predictions made by the betting market to definitely assess my most accurate prototype.

5.2.4 Schedule Evaluation

Clearly, the alteration to the project affected its performance. The need to revise the original schedule
to such a degree meant I had less time to develop the numerical model. During the schedule revision,
I underestimated the amount of reading needed for the numerical analysis. The ‘Background reading -
numerical algorithms’ was carried out up to around the end of February. This left only 2 or 3 weeks to
build further prototypes on top of the first three, around 2 weeks less than scheduled for.
Although this hindered me, the key point in the original/revised schedule was starting the design and
implementation of Prototype 3 during the initial stages of background reading. If this had been started
any later, this would have further decreased the amount of work carried out on further prototypes.

5.2.5 Methodology Evaluation

The prototype approach was vital in allowing me to add and remove different features from within the
Goddard & Asimakopoulos base. This can be seen specifically within the testing of the ranking features
(Section 4.5.3). Here, I needed to quickly test whether the ranking attributes were useful on their own.
Prototyping allowed me to do this without affect the project’s structure.
Technologies such as Java and WEKA allowed me to implement this project efficiently, without
major setbacks. The HashMap class within Java’s library enabled me to create the results matrix (Sec-
tion 4.3.2) which aided in result exploration. Furthermore, the Object-Orientated architecture helped
me to represent each match and use the methods stored within this Match class to create the vectors. Al-
though Python may have been more efficient in the parsing of text files and web-spidering, the standard
Java classes allowed me to carry out these processes with just as much ease and flexibility.
The choice of WEKA was justified throughout the project. The Java compatibility enable me to
quickly build classifiers using the training data and test them. This structure allowed me to efficiently
‘plug-in’ the corresponding ARFF files to assess each prototype. Furthermore, the Remove class within
WEKA was used to quickly remove the relevant features ‘on the fly’ within an ARFF file rather than
re-creating the file itself. This was a great help during the feature ablation studies.

49
Chapter 6

Conclusion

6.1 Conclusion
The problem set by this project was to see if numerical data could be used to accurately predict the
outcome of matches within American Football. This was approached using a logistic regression model
enclosing information based on previous results e.g. the recent form of the competing teams. The
model also included novel features such as the distance between the two teams and the two team’s
stadium capacities. This model was seen to hold more predictive qualities than simply choosing the
home team or relying on previous results between the two teams.
The model was modified to include the score difference within both team’s recent games and al-
though was not found to significantly improve the accuracy of the original model, it produced a more
informative system for predicting an NFL game. Ultimately, this modified regression model was seen
to compete with predictions made by the betting line, where the system’s forecast were found to be no
worse than the bookmaker’s.

6.2 Further Work


• By assessing the probability produced by the logistic model for each match:
– One could analyse whether these probabilities could be used to improve the forecasts .
– Or whether these probabilities match the size of the betting line spread. (i.e large probabili-
ties map to large betting spreads).

• A different regression technique could evaluate the project’s model. This could either be to try
to improve the match winner predictions using another binary-output technique (e.g. Maximum
Entropy). Alternatively, a model which produces a continuous result, representing the predicted
spread of a match could be used to compare to the actual margin of victory. One such approach
could be Ordinary Least Squares as used by Golec and Tamarkin [28].
• If a web crawler program was created and professional expert opinions could be found then this
subjective data could be utilised in order to obtain a superior prediction to an NFL match.
• The numerical model could be turned to another professional sport, e.g. ice hockey. This would be
justified by work carried out by Stefani [44] in which he used his model on a number of differing
sports and attained consistent accuracies.

50
Bibliography

[1] Chronology of home stadiums for current national football league teams.
http://en.wikipedia.org/wiki/Chronology of home stadiums for current National Football League teams.

[2] City distance tool. http://www.geobytes.com/CityDistanceTool.htm?loadpage.

[3] Digital tv from sky. http://www.sky.com/portal/site/skycom/skyproducts/skytv.

[4] Directions.pdf. http://www.nflfootballstadiums.com/Directions.pdf.

[5] Football outsiders: Football analysis and nfl stats for the moneyball era.
http://www.footballoutsiders.com/stats/teameff.

[6] Football outsiders: Football analysis and nfl stats for the moneyball era.
http://www.footballoutsiders.com/info/FAQ.

[7] Football outsiders: Football analysis and nfl stats for the moneyball era.
http://www.footballoutsiders.com/info/methods#dvoa.

[8] Ladbrokes profits jump on punter’s losing streak. http://www.iht.com/articles/reuters/2008/02/28/business/OUKBS-


UK-LADBROKES.php.

[9] Learn about java technology. http://www.java.com/en/about/.

[10] Nfl’s beginner’s guide to football. http://www.nfl.com/rulebook/beginnersguidetofootball.

[11] Nfluk.com - about the game - rookie faqs. http://www.nfluk.com/about-the-


game/rookies faqs.html#pagetop.

[12] Python programming language – official website. http://www.python.org/.

[13] Super bowl 43 super bowl history. http://www.nfl.com/superbowl/history.

51
[14] Super bowl xlii tackles record 97.5 million viewers. http://www.multichannel.com/article/CA6528715.html.

[15] This day in history 1984: Baltimore colts move to indianapolis. http://www.history.com/this-day-
in-history.do?action=Article&id=56982.

[16] Usatoday.com. http://www.usatoday.com/sports/sagarin-archive.htm#nfl.

[17] T. Baker. Building a system to recognise predictive opinion in online forum posts. Final Year
Project, University of Leeds, 2007.

[18] B.L. Boulier and H.O. Stekler. Predicting the outcomes of national football league games. Inter-
national Journal of Forecasting, 19(2):257–270, 2003.

[19] S. Le Cessie and J.C. Van Houwelingen. Ridge estimators in logistic regression. Applied Statistics,
pages 191–201, 1992.

[20] S. Chatterjee and A.S. Hadi. Regression analysis by example. Wiley-Interscience, 2006.

[21] T.G. Dietterich. Approximate statistical tests for comparing supervised classification learning
algorithms. Neural Computation, 10:1895–1923, 1998.

[22] M.J. Dixon and S.G. Coles. Modelling association football scores and inefficiencies in the football
betting market. Applied Statistics, 46(2):265–280, 1997.

[23] M.S. Duh, A.M. Walker, M. Pagano, and K. Kronlund. Prediction and cross-validation of neural
networks versus logistic regression: using hepatic disorders as an example. American journal of
epidemiology, 147(4):407–413, 1998.

[24] A.J. Dwyer. Matchmaking and mcnemar in the comparison of diagnostic modalities. Radiology,
178(2):328, 1991.

[25] D. Forrest, J. Goddard, and R. Simmons. Odds-setters as forecasters: The case of english football.
International Journal of Forecasting, 21(3):551–564, 2005.

[26] D. Forrest and R. Simmons. Forecasting sport: the behaviour and performance of football tipsters.
International Journal of Forecasting, 16(3):317–331, 2000.

[27] J. Goddard and I. Asimakopoulos. Forecasting football results and the efficiency of fixed-odds
betting. Journal of Forecasting, 23(1):51–66, 2004.

52
[28] J. Golec and M. Tamarkin. The degree of inefficiency in the football betting market : Statistical
tests. Journal of Financial Economics, 30(2):311–323, December 1991.

[29] P.K. Gray and S.F. Gray. Testing market efficiency: Evidence from the nfl sports betting market.
Journal of Finance, 52(4):1725–1737, September 1997.

[30] D. Harville. The use of linear-model methodology to rate high school or college football teams.
Journal of the American Statistical Association, 72(358):278–289, 1977.

[31] D. Harville. Predictions for national football league games via linear-model methodology. Journal
of the American Statistical Association, 75(371):516–524, 1980.

[32] F. Hu and J.V. Zidek. Forecasting nba basketball playoff outcomes using the weighted likelihood.
Lecture Notes-Monograph Series, 45:385–395, 2004.

[33] R. Hughes and M. Cotterell. Software project management. McGraw-Hill Higher Education, 2006.

[34] D. Jurafsky, J.H. Martin, A. Kehler, K. Vander Linden, and N. Ward. Speech and language pro-
cessing: An introduction to natural language processing, computational linguistics, and speech
recognition. MIT Press, 2000.

[35] G.K. Kanji. 100 Statistical Tests. Sage Publications, London ; Newbury Park, Calif. :, 1993.

[36] S.M. Kim and E. Hovy. Crystal: Analyzing predictive opinions on the web. In Proceedings of the
2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL), pages 1056–1064, 2007.

[37] D.G. Klienbaum and M. Klein. Logistic Regression: A Self-Learning Text. Springer, 2002.

[38] R.H. Koning. Balance in competition in dutch soccer. The Statistician, 49(3):419–431, 2000.

[39] A. McKinlay. A system for predicting sports results from natural language. Final Year Project,
University of Leeds, 2007.

[40] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? sentiment classification using machine learn-
ing techniques. In Proceedings of 2002 Conference on Empirical Methods in Natural Language
Processing (EMNLP, pages 79–86, 2002.

53
[41] H. Rue and O. Salvesen. Prediction and retrospective analysis of soccer matches in a league. The
Statistician, 49(3):399–418, 2000.

[42] J.P. Shaver. What statistical significance testing is, and what it is not. Journal of Experimental
Education, 61:293–316, 1992.

[43] C.U. Song, B.L. Boulier, and H.O. Stekler. The comparative accuracy of judgmental and model
forecasts of american football games. International Journal of Forecasting, 23(3):405–413, 2007.

[44] R.T. Stefani. Improved least squares football, basketball, and soccer predictions. Systems, Man
and Cybernetics, IEEE Transactions on, 10(2):116–123, February 1980.

[45] P. Turney. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification
of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics, pages 417–424, 2002.

[46] R.C. Vergin and J.J. Sosik. No place like home: an examination of the home field advantage in
gambling strategies in nfl football. Journal of Economics and Business, 51(1):21–31, 1999.

[47] T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing contextual polarity in phrase-level sentiment
analysis. In Proceedings of the conference on Human Language Technology and Empirical Meth-
ods in Natural Language Processing, pages 347–354. Association for Computational Linguistics
Morristown, NJ, USA, 2005.

[48] I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Mor-
gan Kaufmann, San Francisco, 2005.

[49] H. Yu and V. Hatzivassiloglou. Towards answering opinion questions: Separating facts from opin-
ions and identifying the polarity of opinion sentences. In Proceedings of EMNLP-03, pages 129–
136, 2003.

54
Appendix A

Personal Reflection

The main thing for me was to choose a project which I would enjoy right the way through and this was
definitely the case here. This would be my first suggestion to anyone carrying out their final year project
in future years, choose a project that you will enjoy. Otherwise, interest will be lost and the student will
be unable to put their full devotion to the work.
In terms of data collection, clear one alteration I would make would be to check to see if the expert
data was available before starting the textual background reading. Although, I found this research
interesting, it was ultimately pointless and ate into my project’s time. If this research was not carried
out, then I would have finished my numerical background reading sooner and would have had more
time to spend on developing more prototypes. This is highlighted within the project’s Further Work
(Section 6.2). Whilst writing this, I realised how many ways this project could have been extended and
analysed with more time. Thus, two more pieces of advice from me would be that students embarking
on a data collection-based project like this ones should gather all data first before firmly deciding on
a project. Also, when the minimum requirements have been decided, try to make time within your
schedule for the project’s potential extensions.
One key suggestion I would give to somebody would be to always attend the meetings with their
supervisor and always have questions ready to which they can write down the response to during the
meeting. Although this should not need to be suggested, if these meetings are recorded on paper then
it is a good way of keeping account of what was mentioned during the various stages of project. Some
people suggested to keep a diary, however I felt this was not needed within my project as I had recorded
all the meetings through this process.
Whilst searching for the betting line data, I had reached a dead-end similar to when I looked for
the textual opinions. This lead to e-mailing the authors of papers where this data was mentioned. After
writing to Prof. Philip K. Gray, this allowed me access to the data I had been looking for (which was
vital in the analysis of my project). I dealt with this correspondence in a polite and professional manner,
which enabled me to get a quick response from him and ultimately he was kind enough to send me the
data. If I had been impolite or informal, he may not have even responded. Thus, another tip for future
work would be to hold formal correspondence with any third-parties within the project.
As mentioned within Section 2.3, the process of writing both reports (mid-project and final report)

55
is a lengthy one. This may seem like an obvious statement to somebody who has not carried out a report
of this size. However, the routine of writing a report is very much an iterative one. This involves writing
initial content, checking through it, realising some areas are irrelevant, realising some areas are missing.
Furthermore, this is before one has considered proof reading the work. Therefore, it is my suggestion
to any prospective third-year student to start this writing process during implementation of other work
so that enough time is allocated to complete the reports.
Moreover, with regards to the mid-project report, some of my peers did not pay enough attention
to this. Personally, I felt this was a big opportunity to get vital feedback on the direction of your project
at Christmas. It can be easy for people to think that because it does not count directly towards the overall
project mark then it is not worth spending much time on. However, the comments within my mid-project
report advised me to be more detailed about how a regression approach is implemented, which I feel I
have rectified within this final report. Additionally, I was told to make sure that it is explicitly stated
where I have created scripts and programs within the project rather than using another developer’s code.
Hopefully, due to the mid-project assessment comments I have dealt with this here. Lastly relating to
the mid-project report, if nothing else it gives the student the opportunity to get a lot of the final report
writing out of the way. Although I have adjusted parts, I used the mid-project report as a basis to write
my final report.
In summary, one word I would use to summarise this project is addictive. On countless occasions,
I wanted to push a certain prototype further or utilise another technique to try to improve the accuracy.
All-in-all, I feel I was fairly disciplined with regards to this. However, I could easily see how somebody
could become ‘engulfed’ by this urge to improve the model. Therefore, if someone was to carry out a
similar research investigation to this one, they need to define strict deadlines, like I have and stick to
them to ensure that all aspects of the project are fulfilled.

56
Appendix B

Project Schedule

Figure B.1: Original Project Schedule

57
Figure B.2: Revised Project Schedule

58
Appendix C

PREV RES Algorithm

Algorithm 1 PREV RES algorithm


for all match in footballResults.txt do
if match.Result != Tie then
teamATally = 0
teamBTally = 0
get resultTallies for match.TeamA vs match.TeamB from result matrix
years = years to go back
yearToken = match.Year - years
while yearToken < match.Year do
teamATally += number of teamA wins in yearToken from resultTallies
teamBTally += number of teamB wins in yearToken from resultTallies
yearToken++
end while
if teamATally > teamBTally then
predict teamA to win
else
if teamATally < teamBTally then
predict teamB to win
else
predict teamA [Home team]
end if
end if
end if
end for

59
Appendix D

Feature Ablation Results

The following two tables display the results of feature analysis carried out within the project’s Prototype
5. The first (Table D.1) displays feature ablation studies within the model. The second, Table D.2 shows
how useful each feature was within the final model (Prototype 5). A key for the features can be seen in
Table D.3.

60
Table D.1: Accuracies of Prototype 5 During Feature Ablation Studies

Removed Feature Accuracy (%)


homerecentawayscdif1 60.635
homerecenthomescdif5 61.036
distance 61.118
awayrecentawayscdif4 61.133
awayrecenthomescdif4 61.219
awayrecenthomescdif3 61.231
awayrecentawayscdif2 61.263
awayrecenthomescdif2 61.298
homerecenthomescdif1 61.371*
awayrecenthomescdif5 61.388
homerecenthomescdif4 61.457
homerecenthomescdif2 61.496
homerecentawayscdif3 61.500
homecapacity 61.561
awayrecenthomescdif8 61.568
awaywinratio 1 61.581
awayrecenthomescdif1 61.620
homerecentawayscdif2 61.650
homewinratio 1 61.685
homerecenthomescdif9 61.698
awaycapacity 61.702
awaywinratio 2 61.704
awayrecentawayscdif3 61.723
awayrecenthomescdif9 61.740
awayrecenthomescdif7 61.742
homerecenthomescdif8 61.749*
homerecenthomescdif6 61.790
lastyearresult 61.873*
homewinratio 2 61.925
homerecenthomescdif7 62.019
homerecentawayscdif4 62.058
awayrecentawayscdif1 62.091
awayrecenthomescdif6 62.185
homerecenthomescdif3 62.233

* Statisically significant

61
Table D.2: Weight Coefficients Within The Logistical Model Used in Prototype 5

Feature Weight Coefficient


homewinratio 1 0.0169*
homerecentawayscdif3 0.01062*
awayrecenthomescdif1 0.01023*
awaywinratio 1 0.00974*
awayrecenthomescdif3 0.00943*
homerecentawayscdif1 0.00941*
homerecenthomescdif3 0.00889*
homerecentawayscdif2 0.00887*
homerecenthomescdif2 0.00831*
awayrecenthomescdif5 0.00758*
homewinratio 2 0.00752*
homerecenthomescdif4 0.00745*
homerecenthomescdif1 0.00641*
awayrecenthomescdif9 0.00473*
awayrecenthomescdif2 0.00461*
awayrecenthomescdif4 0.0046*
homerecenthomescdif7 0.00437*
homerecenthomescdif8 0.00436*
awaywinratio 2 0.0043*
homerecenthomescdif5 0.0041*
awayrecentawayscdif2 0.00408*
awayrecentawayscdif3 0.00398*
homerecentawayscdif4 0.00387*
awayrecentawayscdif4 0.00385*
awayrecentawayscdif1 0.00381*
awayrecenthomescdif6 0.00377*
homerecenthomescdif6 0.0035*
homerecenthomescdif9 0.00316*
awayrecenthomescdif7 0.00282*
lastyearresult 0.00243*
awaycapacity 0
awayrecenthomescdif8 0
homecapacity 0
distance 0

* Statisically significant

62
Table D.3: Prototype 5 Feature Key

Feature Description
awaycapacity The stadium capacity of the away team
awayrecentawayscdifn The score difference in the nth recent away game for the away team
awayrecenthomescdifn The score difference in the nth recent home game for the away team
awaywinratio 1 The away team’s win ratio for last year
awaywinratio 2 The away team’s win ratio for 2 years previous
distance The distance the away team had to travel to play the match
homecapacity The stadium capacity of the home team
homerecentawayscdifn The score difference in the nth recent away game for the home team
homerecenthomescdifn The score difference in the nth recent home game for the home team
homewinratio 1 The home team’s win ratio for last year
homewinratio 2 The home team’s win ratio for 2 years previous
lastyearresult The result of last year’s corresponding game between the two teams

63

You might also like