Download as pdf or txt
Download as pdf or txt
You are on page 1of 9


A Data Science Approach to Football

Team Player Selection
Dr P Rajesh1, Bharadwaj2 Dr Mansoor Alam3, Dr Mansour Tahernezhadi4
Associate Professor1, Researcher2 KLEF University , Professor Northern Illinois University3-4,
rajesh.pleti@kluniversity.in1, donepudi.babitha@gmail.com2, 3, 4

customization of detection of players in terms of role ,

Abstract- FIFA (Fédération Internationale de Football
Association) is world football (soccer) league that is separate versatility , weight from match to match , no of teams and
from Olympics. FIFA been largely instrumental for making countries, shorten of Machine learning and Artificial
soccer as the most popular game in the world. It has led to intelligence techniques for selection of players [1,2,3].
development of many private soccer clubs all over the world.
Creating new clubs with young players, loaning players from
. Luke Bornn and Javier Fernandez etc , described spatio-
other clubs, picking choice positions, determining wages and temporal data analysis of soccer data to mine meaningful
remuneration to players based on performance and international insights using quality, position , frequencies, success of space
rankings is complicated decision process in terms of global occupation and goal generation. It also provides performance
business perspective. This paper presents a data science team cooperative dynamic decisions during off ball soccer
approach to minimize the time taken in selecting a player for a activities to pitch control and modelling and bounded to game
team by considering the cost and player’s skills as constraints. specific, current players actions, space complexity utilization
Such an analysis will help an owner to maximize the profit and for goal generation [4].
popularity of an existing club or to create a new club. We present
statistical analysis of player performance based on abilities and Gennady Andrienko, Natalia Andrienko etc, incorporated
skills for a new team using powerBI and Python Pandas by perception characterizing behaviors of players analysis. It
minimizing the cost. The results show that it leads to improved also concentrate on mining pressure association relations by
business profits through a systematic enhancement to football defender tactics, insight patterns, various disjoint time
data sets. These kind of approaches and analytical results can be
intervals. This paper also model, implemented novel
useful to franchisor of proprietary knowledge to form group of
selected players as team. interactive visualization tool "time mask" and summarizes
static, dynamic visualizations exerted among team members.
Keywords: Predictions, visualizations, statistical analysis, Sports
It also bounds to ball possession, possession change, position,
Analytics, Clustering, Searching, Multi-dimensional, Power BI,
Manuel Stein, Halldór Janetzko etc, provides context
analysis, pattern detection, data perspective analysis and
I. INTRODUCTION visualization analysis taking into consideration players
behaviour and constraints on data set. It also bounded to
Sports analytics , multidimensional data analysis , Big data strategies, tactics, collaboration, competition, group
and Predictive business analytics got significant attention by movement, effects to obtain insights and to serve many goal
almost all industries to provide better services to their generations for getting better performance.multi dimensional
stakeholders. At the same time, sports companies, websites, heterogeneous players, team data analysis is decisive to data
broadcasters , online platforms extensively make use of scientists to extract hidden patterns in complex environment
statistical and predictive analytics to identify players insights, [6,7].
scoring patterns and comparative based professional players
selection using goal desired characteristics rising everywhere. Mehrsan Javan, Philippe Desaulniers etc, illustrate player
these kind of approaches are continuous in real time dynamic performance evaluation system based on locations, actions,
applications, complex in nature to predict the analytics and players chances to achieve the next goal. It exploits, machine
fascinated many researchers for players performance learning techniques, Markov model to identify the game
evaluations, predicting optimal solutions , Decision making in circumstances to predict next goal and to group similar
timely manner locations of actions with high and low values. It also gives
statistical modelling evaluations among protective players and
Luca P, Paolo C etc, provided data driven role aware multi their corresponding context actions to predict a goal [8,9].
dimensional evaluation of player ranking based on player
versatility, players performance and representation of more

978-1-7281-5317-9/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 05:59:44 UTC from IEEE Xplore. Restrictions apply.

Paolo Cintia, Salvatore Rinzivillo etc , exploits statistical

analysis of inspection of players’ performance in pitch and
extracted game network features to improve players and team
strategies as of ProZone , Opta football data analysis . it also
describes Standard evaluation , prediction of performance
players, team analysis based on previous associated factors
like defeats, victories, record actions, goals in qualification in
past games will no longer be validated [10,11].
QingWang, Hengshu Zhu,Wei Hu etc, proposed
unsupervised approach to evaluate typical tactics using Team
Tactic Topic Model (T3M) learning technique based on
position and passing relations of football players. it also
concentrates collaborative team changing actions to score
goal. it bounded to features like pass segment, spatial Fig. 1.2: Visualization of player’s count with respect to age.
analysis, tactical pattern discovery, player roles [12,13].
Figure 1.2 help to extract the knowledge about the players
with respect to ages and based on that club owner can easily
Most approaches and methods concentrates on dynamic pick players for their team by observing that players between
analytics and represent summarized score boards during game age range of 21 to 26 are more active and after 26 most of
time based on situations, context, opponent analysis, goal them retiring or unfit to play. Player’s support to club is
generation and post analytics , performance issues of different determined by the owner according to their contracts, club
players 15,16]. In the proposed approach we mainly focus on owner can make decision to decrease or increase the contract
prior analysis of a players selection based on performance, period consequently according to their performance. For
skill set, forming team, minimize time selection by reducing instance, 24-year-old player 2-year contract is adequate and
the cost initially to avoid further consequences risk factors of for a 21-year-old player 5-year contract is profitable [19].
unproductive team environment.
Visualizations helps in understanding the behaviour of data.
. Power BI, a business intelligence tool developed by Microsoft
helps to visualize the data, to produce effective visualizations.
A custom chart be selected from marketplace or users can
write their own python or R plot scripts.


Statistical analysis performed based on following

qualitative and quantitative measures of attributes consider
into account.
i. Overall Distribution Value of each player
according to overall Rating, performance and age
ii. Number of players distribution from each
Fig. 1.1: Process of Knowledge discovery nationality and their social impacts for investors
Prediction analysis requires accurate queries and interpreted to select a player as brand ambassadors.
by logic representation in multi dimensional data to find iii. Comparing overall Performance and potentiality of
relevant attributes age, count, country[17]. It requires two players by nationality
parameters, one is the age and another is number of players in The analytical comparison is based on qualitative property
Football having that age. Data mining gives the information (nationality)with multiple quantity properties (performance,
of ages and count of players using groupby method and potentiality, age, Value, Wage, Position, RCB, RB, Crossing,
visualizes the information in form of bar chart.[18] Finishing, HeadingAccuracy, Short Passing) for a same data
point. A nationality with high potential showing higher
overall is well suitable for international marketing and the

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 05:59:44 UTC from IEEE Xplore. Restrictions apply.

nationality with moderate potential and overall performance LCM Left Centre RS Right
will stand out in their locality only. Midfielder Striker
In Data science, analysis criteria is significant for LDM Left RWB Right
knowledge extraction. A right query demands a relation defence Winger
between the parameters and metrics of evaluation. For Midfielder Back
example, “players preference for each field Position and LF Left ST Striker
demand to count number of players for each position, but it
alone will not give accurate predict any information analysis. LM Left LW Left Winger
we can understand importance of each position. “Stricker Midfielder
(ST)” is the dominant position followed by “Corner back LS Left striker LWB Left Winger
(CB)” and one is a hitter position and another one is back
defensive position. In football most of the happens between
the goals. If the opponent striker managed to get the ball
passing defence that is almost a goal in most of the cases..

Fig. 2.1: Visualization of players preference each field Position.

Fig. 2.2: Positioning in field
Striker make the points and Defence team (CB, RWB, LWB)
to stop opponents without goal.
As shown in Figure 2.2, Position occupancy by players is
TABLE I: Position of Players in the field crucial when considering winning criteria and their
Abbreviation Full Form Abbreviation Full Form reputation. For winning criteria, overall rating of players
CAM Centre RDM Right within the age and positions were taken into account. Table 1
Attacking Defence shows the list of Positions.
Midfielder Midfielder
CB Centre RM Right
back Midfielder
CDM Centre RW Right
Defence Winger
CF Centre RAM Right
Forward Attacking
Mid Fielder
CM Centre RB Right Back
LAM Left RCB Right
Attacking Centre Back
LB Left Back RCM Right
LCB Left centre RF Right Fig. 2.3: Number of players per overall rating
back Fielder

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 05:59:44 UTC from IEEE Xplore. Restrictions apply.

From the above figure 2.3, it is observed that most of the Figure 2.5 represents the player’s nationality, the higher
players are in range of 64 to 69 overall performance and the surface area of the indicator the higher the count of the
almost no players are at extreme overall performance but for players from respective nations which itself indicates the
winning criteria considering age with overall performance is follower base and it shows a greater number of players are
crucial as the contract period leads to profit and loss for a from England and its neighbouring nations followed by south
club owner thus, moderately performing young player is America and Africa. Africa may not represent here in higher
much beneficial comparing to an elderly better performing surface area but combining all the sub nations in Africa
player in terms of their value within the data, definitely is a higher number. Plain statistics on count of
nations is not much helpful, extending this to the comparison
From the figure 2.4, players with overall performance more (performance, potentiality, age, Value, Wage, Position, RCB,
than 80 and having less values are extend among all the ages. RB, Crossing, Finishing, HeadingAccuracy, Short Passing)
For profitable investment, picking players from age group of of overall and potential of nations is supportive.
21 to 26 from figure 1.2 is sensible in business perspective.
Likewise, Reputation of club is also depending on their
player’s fan base which means nationality of the player.
Players of different nations shows the way to the increase of
club recognition and international marketing which eventually
increases number of investors around the globe increasing the
share values.

Fig. 2.6: Comparing overall and potentiality of players by

The figure 2.6 provides a clear picture toward selection of
players considering the reputation criteria. A stacked bar plot
is very helpful in representing the data of a qualitative
property with multiple quantity properties for a same data
point. The above diagram shows that England’s overall
Fig.2.4: Value of each player according to overall Rating potential and capability is better and France also have the
and age equal overall and potential levels. When selecting players of
different nations, capability plays a crucial part. Capability of
players indicated the interest in people of that nation towards
the sport which is very helpful in market expansion.

The Objective of this paper are described as follows
1. To identify and Compare player’s individual
performance by their age and nationality
2. To identify and Compare player’s value and their
overall performance
3. To correlate skills of players to predict player’s

Fig. 2.5: Number players from each nationality 4. To perform Cluster/classification Techniques for
positions of players according to their age and overall

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 05:59:44 UTC from IEEE Xplore. Restrictions apply.

5. Comparing the qualitative and quantitative properties Output:

towards reputation List of Positions tuple with players in it
[(“Position1”, “Player4”),(“Position2”, “Player7”),….]
From the above objectives we can identify a goal of
creating a dream team by reducing time and cost.For first
two objectives, we can identify whether the player is Overall, Potential, Value, Wage, Position, Age, RCB, RB,
compatible to team norms like age restriction, value vs Crossing, Finishing, HeadingAccuracy, Short Passing,
performance ratio. If the player is compatible, then we will Volleys, Dribbling, Curve, FKAccuracy, LongPassing,
proceed to test the next objective to analyse his skill suitable BallControl, Acceleration, SprintSpeed, Agility, Reactions,
for a particular position. For a all rounder player, correlation Balance, Jumping, Stamina, Strength, LongShots, Aggression,
may not give exact representation analysis because all skill Interceptions, Positioning, Vision, Penalties, Composure,
set have equal importance values, in that scenarios, we Marking, StandingTackle, SlidingTackle.
conducted cluster analysis for all positions with their
performance values and age factor into consideration and 1. Create a sub-population p with Overall, Age, Position
then it will gives player’s skill in each cluster position. The attributes from P
cluster analysis will return highest performance level to be 2. Group samples in p based on clustering technique
considered for position analysis. Using this technique, we where each cluster is a Position derived from Age
can achieve the following goals. and Overall Performance
1. Minimizing the time taking for choosing a player 3. For each Position in Positions:
2. Help businesses to pick brand ambassadors 4. Obtain a correlation matrix Positionc
3. Build a dream team. 5. For each player in Position:
6. Obtain Correlation matrix Playerc
7. If coeff(Positionc)-coeff(Playerc)< ࣅ then
8. Add Playerc to Positionc
Football dataset consists of around 18 thousand records 9. Else
with 'Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 10. Pick next player for evaluation
'Value', 'Wage', 'Position' attributes and ranking at each 11. If Positionc contains >1 player then
position with ranking at each skill will describe new 12. Positionc -> check(wage(Playerc), threshold)
prediction data analysis. Football dataset consists of a few 13. Add positionc to Player_Position_Tuples
null values and a few numbers as strings instead of integers, 14. return Player_Position_Tuples.
it can be transformed using Pre-Processing step. Most of the
null values are from the column “Loaned From” and column Table 3.1: Classifier Models and Evaluation results
“Value”, “Wages” are also string data type performed
transformation for analysis. Now data is ready for analysis S. Algorithm Accuracy F1 Jaccard
to find insight interesting patterns. The dataset is divided N Score Score Similarity
into training set and test set in the 80:20 split ratio to o
implement the classification algorithm. 1 Naive Bayes 0.460748 0.516143 0.610273
2 Random 0.832546 0.920906 0.856523
In this paper, we explored the ways of data analysis and Forest
machine learning to improving making decision and 3 Decision 0.766956 0.811075 0.735357
business growth by better identification of player with Tree
business partnership on Football datasets. 4 SVC 0.753468 0.809675 0.784768
5 Proposed 0.783258 0.795274 0.764862
Pseudo Algorithm prediction of
Team players
1. Initiate population(P) of players data with details
about their performance
2. A Clustering Model that can be performed on larger

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 05:59:44 UTC from IEEE Xplore. Restrictions apply.

Fig 3.1: ROC curve showing the accuracy of


In this data set, scope of analysis is enormous as the skill-

set of players correlated to each other, which helps in
picking up a team of specific requirements in winning
perspective and business perspective,. It can be quite Fig. 3.2: Correlation between overall and performance at
challenging especially when a player is showing high skill each skill
set on multiple positions. Business people usually tend Every variable (Key influencers) in the correlation matrix
towards the overall skill set instead of skill set at particular is perfectly correlated to itself. From figure 3.2, {Sprint
position which is not suitable for all kind of situations, for speed Acceleration}, {Sliding Tackle, Marking}, {Standing
these analysis representation “correlation matrix” approach Tackle, Marking}, {Standing tackle, Sliding tackle} got high
is beneficial. Correlation matrix is a symmetric matrix relation, that means player with high standing tackle should
consisting of correlation coefficients between attributes or also be impressive at marking and sliding tackle. Rough
variables. In this analysis the variables are Skills of players. decision starts when assigning positions to players, the
Correlation between all the skills and overall performance correlation matrix of a particular position helps us to make
will help to predict how the performance of a player with a decisions faster.
particular skill set influence the position.
Let’s consider a scenario of player X with overall of 63
preferred a position A, but position A is already assigned to
another player So, we have decided to assign another
position to him by checking all the positions that matches
his skills or we can simply use the correlated matrix above
to match skills for Position A.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 05:59:44 UTC from IEEE Xplore. Restrictions apply.

Key influencers vary for every position, as showed in fig

3.4, for Stricker’s position finishing is the key influencer, in
view of all the key influencers above tolerance value using
correlation matrix, gives accurate and precise results.

Fig. 3.3: Correlation between skills required for striker’s

The drawback for using only correlation is two Parametric Fig. 3.5: Multiple Groups of players with respect to age and
observation analysis to picking up players according to overall performance
positions and rating. A player of a specific age may not fit
for a specific position even though his skills are sufficient Clustering takes each data object and group according to
and as consider for overall rating order. For example, we behavior similarity. i.e players of a cluster are most likely to
picked some players with overall rating as a constraint but prefer the positions of same cluster but will not prefer any
unfortunately all of them are senior players, this is obvious other position.
as the senior players have much experience comparatively.
But for a good team, player of mixed ages is better. So we
need to group players for positions with overall ratings and
age as constraints.

Fig.3.6: Clustering of player’s position based on overall

performance and age
Along with machine learning classification algorithms, K-
Means clustering algorithm is implemented to cluster
player’s position based on their overall performance and
age. K-means is a partitioning based clustering approach. k
is taken as 3 for simplification and perform clustering using
centroids, iterative relocation technique to achieve intra
Fig 3.4: Key Influencers for Stricker’s position cluster similarity and inter cluster dissimilarity. Winning

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 05:59:44 UTC from IEEE Xplore. Restrictions apply.

criteria is supported by the parameters age, performance and These analytical results also helpful for many NGOs to
position. The nationality and overall-potential supports motivate people with physical disabilities. For example, an
represents reputation of the club, but in business perspective, NGO wants to motivate lean body children or low self-
as a club owner, not only arrangement and assessment of esteemed single dextrous people of different cultures, they
players but also paying players is a major concern. can make a report of football players with those abilities can
also have higher market value.


The power of analysis of Data Science makes businesses
become more profitable. Business decisions are often tough
to make in timely manner for complex datasets. This paper
has presented a Data Science approach and novel Pseudo
code Algorithm to select player from different nations to
corresponding positions by employing international business
expansion decisions in a timely manner and incorporating
analytical results in terms of features such as player’s skills,
performance, positions, ratings, wages heading-accuracy,
short-passing, volleys, dribbling, curving, FK-accuracy,
long-passing, ball-control, acceleration, and sprint-speed.
These machine learning analytical extractions support
businesses to pick a player as ambassadors and to form a new
club as an owner. cost reduction means selecting a player
with nominal wages with expected skill set and performance
Fig. 3.7: Average wage of various clubs to positions.
This paper makes reduction of selection of players risk
For a new player wage is depends on value, it can be from factors up to higher initially (i.e, 50%), by prominence
1% to 10% of value but for a player taken from another club different players features based on market value, popularity ,
expects more, otherwise he may reject to join this may leads player’s quality , his performance on nature of contest for
to business disappointment. Figure 3.7 defines the clubs national teams. we also exploited to minimizing time , cost ,
with its average wage amount. Club Real Madrid has the inconsistency between approximate and real market values in
highest wage value. terms of selection of players for a team. These kind of
approaches principally can be used for managing and
commercialization financial profit of sports analytics.

an additional direction, innovative based AI solutions

trying to be incorporated comparative based analytical results
for progressing future enhancements for decision making
on different data sets by considering other aspects like player
injuries, GPS data, spatio-temporal , players video
performance extraction and trajectories data.


We thank FIFA and Kaggle for providing football data set

for analytical research. We are thankful to Professors for
constructive discussions at K L university Analytics Group
and reviewers for valuable suggestions to improve the
manuscript design and statistical analysis of work to meet
respective research objectives.
Fig. 3.8: Market Value of all the players with different body
types and preferred foot

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 05:59:44 UTC from IEEE Xplore. Restrictions apply.

REFERENCES [19] P. Fillipids, "Creating and Using Sports Linked Data: Applications and
Analytics", CEUR Workshop Proceedings (, 2015.
[1] Luca Pappalardo, Paolo Cintia, Paolo Ferragina, PlayeRank: data-driven
performance evaluation and player ranking in soccer via a machine
learning approach, arXiv:1802.04987v3 25 Jan 2019.
[4] Luke Bornn and Javier Fernandez. 2018. Wide Open Spaces: A statistical
technique for measuring space creation in professional soccer. In MIT
Sloan Sports Analytics Conference 2018.
[5] Gennady Andrienko, Natalia Andrienko, Guido Budziak, Jason Dykes,
Georg Fuchs, Tatiana von Landesberger, and Hendrik Weber. 2017.
Visual analysis of pressure in football. Data Mining and Knowledge
Discovery 31, 6 (01 Nov 2017),1793–1839.
[6] Manuel Stein, Halldór Janetzko, Daniel Seebacher, Alexander Jäger,
ManuelNagel, Jürgen Hölsch, Sven Kosub, Tobias Schreck, Daniel A.
Keim, and Michael Grossniklaus. 2017. How to Make Sense of Team
Sport Data: From Acquisition to Data Modeling and Research Aspects.
Data 2, 1 (2017).
[7] Dr.P.Rajesh, K.Vamsikrishna Reddy, "Stock trend prediction using
Ensemble learning techniques in python" International Journal of
Innovative Technology and Exploring Engineering (IJITEE)-2019,
[8] V.I.S. RamyaSri, Ch. Niharika, K. Maneesh, Mohammed Ismail“
Sentiment Analysis of Patients&; Opinions in Healthcare using Lexicon-
based Method” International Journal of Engineering and Advanced
Technology Volume-9 Issue-1, October 2019.
[9] Oliver Shulte and Zeyu Zhao. 2017. Apples-to-Apples: Clustering and
Ranking NHL Players Using Location Information and Scoring Impact.
In MIT Sloan Sports Analytics Conference. Hynes Convention Center,
Boston, MA, USA.
[10] P.Rajesh, Dr.G.Narsimha, "Fuzzy based privacy preserving classification
of data streams." in ACM conference (CUBE), Pune, PP: 784-788, 2012,
DBLP, ISBN: 978-1-4503-1185-4.
[11] Paolo Cintia, Salvatore Rinzivillo, and Luca Pappalardo. 2015. Network-
based Measures for Predicting the Outcomes of Football Games. In
Proceedings of the 2ndWorkshop on Machine Learning and Data Mining
for Sports Analytics co-located with 2015 European Conference on
Machine Learning and Principles and Practice of Knowledge Discovery
in Databases (ECML PKDD 2015) .
[12] QingWang, Hengshu Zhu,Wei Hu, Zhiyong Shen, and Yuan Yao. 2015.
Discerning Tactical Patterns for Professional Soccer Teams: An
Enhanced Topic Model with Applications. In Procs of the 21th ACM
SIGKDD Intl Conf on Knowledge Discovery and Data mining.
[13] K.Srinivas, Mohammed Ismail.B “Testcase Prioritization With Special
Emphasis On Automation Testing Using Hybrid Framework” Journal of
Theoretical and Applied Information Technology 96(13) 4180-4190 July
[14] Rodolfo Metulini, Tullio Facchinetti, Paola Zuccolotto, Detecting and
classifying moments inbasketball matches using sensor tracked
[15] Zuccolotto, P., Manisera, M., & Sandri (2018), M. Big data analytics for
modeling scoringprobability in basketball: The eơect of shooting under
high-pressure conditions. International Journalof Sports Science &
Coaching, vol. 13(4), pp. 569-589.
[16] Naman Gupta, "Off the Ball: A Data Science Approach to Real-Time
Football Fan Engagement" , Thesis report of University of Michigan,
[17] Tom Decroos, Jan Van Haaren, and Jesse Davis. 2018. Automatic
Discovery of Tactics in Spatio-Temporal Soccer Match Data. In
Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery &#38; Data Mining (KDD ’18). ACM, New York,
NY, USA, 223–232.
[18] Paolo Cintia, Fosca Giannotti, Luca Pappalardo, Dino Pedreschi, and
Marco Malvaldi. 2015. The harsh rule of the goals: data-driven
performance indicators for football teams. In Procs of the 2015 IEEE
International conference on Data Science and Advanced Analytics.

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 26,2020 at 05:59:44 UTC from IEEE Xplore. Restrictions apply.

You might also like