NHL Players Salary Project Documentation

A Mini project Report
On
NATIONAL HOCKEY LEAGUE PLAYER SALARY

USING
MACHINE LEARNING
Submitted to JNTUH HYDERABAD
In Partial Fulfilment of the requirement for the Award of Degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted
By
Vasishtanada Reddy. A (188R1A0561)

Sai Kiran. A (188R1A0562)
Hima Prasad. K (168R1A05L2)
Venkata Chaitanya Kumar. T (168R1A05B7)
Under the Esteemed guidance of

Mrs. P. Usha Rani
Associate Professor, Department of CSE
Department of Computer Science & Engineering
CMR ENGINEERING COLLEGE

(Approved by AICTE, NEW DELHI, Affiliated to JNTU, Hyderabad)
Kandlakoya, Medchal Road, R.R. Dist. Hyderabad-501 401)
(2020-2021)
1
CMR ENGINEERING COLLEGE
(Approved by NBA, Approved by AICTE NEW DELHI, Affiliated to JNTU, Hyderabad)
Kandlakoya, Medchal Road, R.R. Dist. Hyderabad-501401)
Department of Computer Science & Engineering
CERTIFICATE
This is to certify that the project entitled “NATIONAL HOCKEY LEAGUE

PLAYER SALARY USING MACHINE LEARNING” is a bonefide work carried
out by
Vasishtanada Reddy. A (188R1A0561)
Sai Kiran. A (188R1A0562)
Hima Prasad. K (168R1A05L2)
Venkata Chaitanya Kumar. T (168R1A05B7)
in partial fulfilment of the required for the degree of BACHELOR OF
TECHNOLOGY in COMPUTER SCIENENCE AND ENGINEERING from
CMR Engineering College, affiliated to JNTU, Hyderabad, under our guidance and
supervision.
The results presented in this project in this project been verified and are found to be
satisfactory. This results embodied in this project have not been submitted to any
other university for the award of any other degree or diploma.
Internal Guide Mini Project Coordinator Head of the Department
Mrs. P. Usha Rani Mr. S. Kiran Kumar Dr. Sheo kumar

Associate Professor Assistant Professor Professor & HOD
Department of CSE Department of CSE Department of CSE
CMREC, Hyderabad CMREC, Hyderabad CMREC, Hyderabad
2
DECLARATION
This is certify that the work reported in the present in the present project entitled
“NATIONAL HOCKEY LEAGUE PLAYER SALARY USING MACHINE LEARNING”
record of bonafide work done by us in the Department of Computer Science and Engineering ,
CMR Engineering College, JNTU Hyderabad. The reports are based on the project work done
entirely by us and not copied from any other sources. We submit our project for further develop-
ment by submit our project for further development for further development by any interest stu-
dents who share similar interests to improve the project in the future.
The results embodied in this project report have not been submitted to any other University or
Institute for the award of any degree or diploma to the best of our knowledge and belief.
By:
Vasishtananda Reddy . A : (188R1A0561)

Sai Kiran. A : (188R1A0562)
Hima Prasad .T : (16R1A05L2)
Venkata Chaitanya : (16R1A05B7)
Kumar
3
CONTENTS
Title page……………………………………………………..………….….…....1
Acknowledgements………………………………………………….…….….….2
Abstract……………………………….……………………………………..…...3
1. Introduction…..………………………….………………...……….….….5-6
2. Data Representation and Visualization……………………….……..…....7-11
3. Models……………………………………………………………………10-12
3.1. Linear Regression…………………………………………………....10
3.2. Ridge Regression…………………………………….………………11
3.3. Lasso Regression……………………………………………….……12
4. UML diagram……………………………………………………………..13
4.1 Use case Diagram…………………………………………………….14

4.2 Class Diagram………………………………………………………..15
4.3 Sequence Diagram……………………………………………………16
5. Output screenshots.....………………………….……….……………….…22
Conclusion……………………………………………....………….……………27
References…………………………………………….…………………….…....28
4
ABSTRACT
In the age of big data, the field of sports analytics is becoming increasingly transdisciplinary,
combining domain specific knowledge from sports management with the statistical and
computational tools of data science. Hockey is a sport that generates massive amounts of data with
no shortage of opportunity for analysis. As statistics becomes more integrated into professional
sports, having an analytical edge gives both athletes and teams a competitive advantage. The goal of
the project is to use machine learning to predict the salaries of NHL players during the 2016-2017
season. This end-to-end machine learning project involves data manipulation, exploratory data
analysis, model training, model optimization, model evaluation and predictive modelling.
Along with the other professional sports leagues such as the NFL and the NBA, the NHL is a cap
dominated market which means that only a certain amount of money can be spent on players’
salaries. General managers find it challenging to optimize the allocation of the salaries to the
players. A league wide salary cap system helps in reducing the instances of players being paid far
more than their worth. However, it is not fully preventing managers from over estimating players’
worth. By using regression methods, teams can prevent overspending on player contracts and thus
build a more competitive team.
5
1. INTRODUCTION
As statistics becomes more integrated into professional sports, having an analytical edge gives both
athletes and teams a competitive advantage. The goal of the project is to use machine learning to
predict the salaries of NHL players during the 2016-2017 season. This end-to-end machine learning
project involves data manipulation, exploratory data analysis, model training, model optimization,
model evaluation and predictive modelling.
In this project, we use supervised learning regression techniques to develop models that can be used
by the general managers to accurately determine the worth of the player. General managers can also
use these statistical learning models to sign players to a lower salary early in their career before they
reach their full potential.
Existing System:
The existing system uses the linear regression model one of the main issues with this basic
linear regression is that it does not have a regularization parameter and hence overfits the data. The
system also does not provide enough preprocessing and visualization or Exploratory Data
Analysis(EDA).
Disadvantages of Existing System:
 The limitations of available systems are not sufficient to deal with the complex data. In this
section, we present some of the limitations that are present in the existing system.
 The model suffers from overfitting due to no generalization of data.
 The error on test data is high due to overfitting.
 The system also requires data extensive data preprocessing and Exploratory Data
Analysis(EDA) inorder to perform feature engineering.
Proposed System:
We aim to build advanced regression models like ridge and lasso regression and also fine
tune the parameters of the model. These models would be trained on a data set which will be
engineered carefully after performing the feature engineering.
Advantages:
 Clearly formulate and explain key hypotheses regarding factors that drive price
 Understand and set lower and upper bounds for model performance
6
 Show how and why key steps in cleaning, EDA, feature engineering, model building and
optimization were executed
 Understand the main factors that drive pricing, and present recommendations based on these
Software requirements:
Operating System : Windows 7 , Windows 8, (or higher versions)
Language : Python 3.5 and other libraries likes numpy, pandas, matplotlib, seaborn and
scikitlearn.
Mozilla Firefox(or any browser)
Hardware requirements:
Processor : Pentium 3,Pentium 4 and higher
RAM : 2GB/4GB RAM and higher
Hard disk : 40GB and higher
2.DATA REPRESENTATION AND VISUALIZATION

7
This dataset features the salaries of 874 NHL players for the 2016/2017 season. I have randomly
split the players into a training (612 players) and test (262 players) populations. There are 154
predictor columns (described in column legend section, if you're not familiar with hockey the
meaning of some of these may be a bit cryptic!) as well as a leading column with the players
2016/2017 annual salary. For the test population the actual salaries have been broken off into a
separate .csv file.
We performed feature engineering on the given data set. We removed the Null values from the data
set and replaced those null values with median, mode or mean based on the columns. Based on the
correlation values, we removed the features having higher correlation with the other features. We
converted the ‘Date of Birth’ feature into ‘Age’ and we performed One-Hot encoding to convert the
columns of type ‘object’ into numeric / binary values to process the data. Some of the Feature
Selection techniques are performed on the data set. We removed some of the features which have
less variance (less than 10%) so as to improve the performance of the model and selected some
columns.
The following pictures show the distribution of Hockey players of different countries:
The distribution of salaries is shown below:
8
9
The following figure represents the information about different teams available:
10
Feature Importance:
This diagram represents the importance of some of the features (i.e., contribution of those features
towards the target value):
Heat map:
Heat map shows the correlation among the features. If correlation between any two features is
considerably high, we can remove one of those two features.
After removing some features considering the correlation among them, the heat map obtained from
the remaining features is as below:
11
12
3. MODELS
Regression is a method of modelling a target value based on independent predictors. This method is
mostly used for forecasting and finding out cause and effect relationship between variables.
Regression techniques mostly differ based on the number of independent variables and the type of
relationship between the independent and dependent variables.
3.1. Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (x) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variables.The linear regression model provides a sloped straight line
representing the relationship between the variables.
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The different values for weights
or the coefficient of lines gives a different line of regression, so we need to calculate the best values
to find the best fit line, so to calculate this we use cost function.
Results:
The dataset was split randomly into Training and Test set then the dimensionality reduction was
carried out on the training and the test set. The model performance in terms of the Accuracy,
Precision, Recall and F1 Score and the final results are as follows.
linear model intercept: 346875962.2610781
linear model coeff:
[ 2.95376314e+03 -1.74635943e+05 -3.23761901e+03 4.98527315e+03
1.73577310e+04 5.30145988e+04 5.90169322e+03 2.28385801e+04
3.72655271e+04 -8.70264039e+02 4.30827012e+06 1.41300876e+05
-1.02430135e+03 2.20728622e+04 -2.50135802e+04 3.35513481e+04
-3.45290411e+04 -1.68612895e+04 -2.21939045e+04 -2.71241116e+03
-8.14557226e+02 -2.00204688e+04 6.69659066e+03 3.86292193e+03
-1.06059509e+04 -8.36147131e+03 -1.38469910e+04 -2.37584516e+04
4.72550499e+02 -5.06639778e+02 2.58212004e+03 -2.10233782e+05
3.24044080e+03 1.45752626e+04 6.59533406e+03 1.14571447e+03
4.67777995e+04 2.85086440e+04 8.04321429e+04 3.24710299e+04
4.42616990e+04 -9.94653023e+04 3.65065707e+04 3.91661311e+04
-5.52586085e+05 -1.58230410e+04 1.38735893e+03 6.07841006e+03
-1.25411392e+03 -7.00339339e+02 5.09717477e+03 -3.24726677e+03
1.59654239e+03 -4.81597531e+05 -1.12309115e+03 1.80045533e+04
8.08844718e+03 -4.83811516e+03 6.84395816e+05 1.96560019e+04
0.00000000e+00 0.00000000e+00]
R-squared score (training): 0.693
R-squared score (test): 0.586
MAE:1064708.087
MSE:2472202623345.332
RMSE:1572323.956
In linear regression, the test score is 0.586
13
3.2 Ridge Regression:
A Ridge regressor is basically a regularized version of Linear Regressor i.e., to the original cost
function of linear regressor we add a regularized term which forces the learning algorithm to fit the
data and helps to keep the weights lower as possible. The regularized term has the parameter ‘alpha’
which controls the regularization of the model i.e., helps in reducing the variance of the estimates.
Ridge regression uses L2 regularization term.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here
the highlighted part represents L2 regularization element.
Results:
ridge regression linear model intercept: 2407475.7394133257

ridge regression linear model coeff:
[ 349618.48388308 -3903762.28757296 -876976.29763292 -172644.20982722
784949.45396927 780010.69103652 -554926.37486197 452989.53055786
1590605.83456925 -91865.87202405 665427.06625916 398674.66078557
-98306.08223311 842577.6387086 1051833.42845883 2508532.31662549
-717800.66463921 -843159.79641446 -822648.95063398 -282009.41151033
250741.30207187 226128.22773934 428088.8379438 343126.56829792
-685682.26713212 -206712.38416249 217198.45620534 -392171.54422868
118367.45814472 -21020.1874135 1660569.27087533 -168306.77471423
93159.53229209 430127.9301021 373680.65479227 -117208.57093471
1771606.17161539 169426.94395699 198083.40834724 952061.59898245
250316.83403896 -422392.87672097 -425029.81556253 78790.8480502
-631131.21457103 702789.5682402 105230.72301668 3494328.24435508
-1232711.24122053 -476749.32935716 1248706.29307909 -773472.19427283
306567.24057786 -1836806.57520068 -2781801.97367905 930534.00822902
275454.34049598 40693.38346575 835093.33059733 1034518.51168593
0. 0. ]
MAE:1012135.708
MSE:2234793657210.046
RMSE:1494922.626
By using ridge regression ,at alpha=1 the test score is 0.63
14
3.3. Lasso Regression:
Lasso regression stands for Least Absolute Shrinkage and Selection Operator. It adds penalty term
to the cost function. This term is the absolute sum of the coefficients. As the value of coefficients
increases from 0 this term penalizes, cause model, to decrease the value of coefficients in order to
reduce loss. The difference between ridge and lasso regression is that it tends to make coefficients
to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero.
Lasso Regression uses L1 regularization term.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of
magnitude” of coefficient as penalty term to the loss function.
Results:
lasso regression linear model intercept: -1983139.4519911231
MAE:1064500.677
MSE:2471123169556.865
RMSE:1571980.652
By using lasso regression ,TEST SCORE IS 0.58.
15
4. UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-

purpose modeling language in the field of object-oriented software engineering. The
standard is managed, and was created by, the Object Management Group.
The goal is for UML to become a common language for creating models of
object oriented computer software. In its current form UML is comprised of two
major components: a Meta-model and a notation. In the future, some form of method
or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software system, as well
as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and
the software development process. The UML uses mostly graphical notations to
express the design of software projects.
16
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they
can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.
17
4.1. USE CASE DIAGRAM:
A use case diagram in the Unified Modeling Language (UML) is a type of

behavioral diagram defined by and created from a Use-case analysis. Its purpose is to
present a graphical overview of the functionality provided by a system in terms of
actors, their goals (represented as use cases), and any dependencies between those
use cases. The main purpose of a use case diagram is to show what system functions
are performed for which actor. Roles of the actors in the system can be depicted.
18
4.2. CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language (UML) is

a type of static structure diagram that describes the structure of a system by showing
the system's classes, their attributes, operations (or methods), and the relationships
among the classes. It explains which class contains information.
19
4.3. SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction

diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are sometimes called
event diagrams, event scenarios, and timing diagrams.
20
4.4. ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities

and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the business and
operational step-by-step workflows of components in a system. An activity diagram
shows the overall flow of control.
21
4. .Output screens
22
23
24
25
26
27
Conclusion
The most important thing while making a model is the feature engineering and selection process.
You should be able to extract maximum information from the features to make your model robust
and accurate. Feature selection and extraction comes with respect to time and experience. There
could be several ways to deal with the information available in the data set. We trained our models
on a data set after performing feature engineering.
There are lot of different ways to make your model learn. The learning algorithm should give you
the best results. You can probably learn different learning algorithms and then ensemble them to
make your model more robust.
28
References
Jones, J., & Walsh, W. (1988). Salary Determination in the National
Hockey League: The Effects of Skills, Franchise Characteristics, and
Discrimination. Industrial and Labor Relations Review, 41(4), 592-604.
doi:10.2307/2523593
 Leo H. Kahane (2001) Team and player effects on NHL player salaries: a
hierarchical linear model approach, Applied Economics Letters, 8:9, 629-
632, DOI: 10.1080/13504850010028607Richardson, D. (2000). Pay,
Performance, and Competitive Balance in the National Hockey League.
 Eastern Economic Journal, 26(4), 393-417. Retrieved from
http://www.jstor.org.ezproxy.library.ubc.ca/stable/40326440
 The Business of Hockey. (2018). Forbes. Retrieved from
https://www.forbes.com/nhl-valuations/list/
29

NHL Players Salary Project Documentation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NHL Players Salary Project Documentation

Uploaded by

Copyright:

Available Formats

A Mini project Report

NATIONAL HOCKEY LEAGUE PLAYER SALARY

Vasishtanada Reddy. A (188R1A0561)

Under the Esteemed guidance of

Department of Computer Science & Engineering

CMR ENGINEERING COLLEGE

Department of Computer Science & Engineering

This is to certify that the project entitled “NATIONAL HOCKEY LEAGUE

Internal Guide Mini Project Coordinator Head of the Department

Mrs. P. Usha Rani Mr. S. Kiran Kumar Dr. Sheo kumar

Vasishtananda Reddy . A : (188R1A0561)

4.1 Use case Diagram…………………………………………………….14

Disadvantages of Existing System:

Operating System : Windows 7 , Windows 8, (or higher versions)

Mozilla Firefox(or any browser)

Processor : Pentium 3,Pentium 4 and higher

RAM : 2GB/4GB RAM and higher

Hard disk : 40GB and higher

2.DATA REPRESENTATION AND VISUALIZATION

The distribution of salaries is shown below:

ridge regression linear model intercept: 2407475.7394133257

By using ridge regression ,at alpha=1 the test score is 0.63

By using lasso regression ,TEST SCORE IS 0.58.

UML stands for Unified Modeling Language. UML is a standardized general-

A use case diagram in the Unified Modeling Language (UML) is a type of

In software engineering, a class diagram in the Unified Modeling Language (UML) is

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction

Activity diagrams are graphical representations of workflows of stepwise activities

You might also like