Professional Documents
Culture Documents
Research Paper Example
Research Paper Example
Research Paper
For
Abstract
This paper analyzed analyze a MLB Hall of Fame (HOF) database to determine which of
the offensive and defensive statistics and associated sabermetrics for position players (eligible as
of 2000) to determine which algorithms and which data combinations produces the best
identification of players likely to be selected for the baseball HOF. The paper investigated the use
of Naïve Bayes, J48 Decision Tree, and Logistics Regression algorithms in the Weka data mining
software package. The paper also investigated the impact of removing the total player ranking
sabermetrics from the database on the classification accuracy as well as accuracy improvements
gained by reducing the attributes set analyzed. In general, logistics regression algorithm was
found to provide the best performance and reducing the number of attributes considered provided
a significant improvement in accuracy, and to a lesser extent the inclusion of the total player
ranking.
MBL HOF selection Page 3 of 37
Contents
Introduction..........................................................................................................................5
Background..........................................................................................................................5
Data Description..............................................................................................................6
Descriptive analysis.........................................................................................................7
Methods.............................................................................................................................13
Choice of a model..........................................................................................................13
Logistic regression.....................................................................................................15
Experimental design.......................................................................................................15
Logistic regression.....................................................................................................24
MBL HOF selection Page 4 of 37
References..........................................................................................................................37
MBL HOF selection Page 5 of 37
Introduction
This paper analyzes the Major League Baseball (MLB) Hall of Fame (HOF) database
available for the class project to determine which data mining algorithms and data of offensive
and defensive statistics, and associated sabermetrics, for position players (eligible as of 2000),
produces the best identification of selected for the baseball HOF. In addition the paper also
considers the impact of the presence or absence of one particular sabermetric variable, total
player rating (TPR) on the accuracy of the different data mining algorithms evaluated.
Background
The MLB Hall of Fame is intended to recognize both players who had exceptional
careers and achievements, and others associated with baseball, such as executives, managers, and
umpires who have made significant achievements and contributions to baseball. Players are
eligible for selection to the baseball Hall of Fame if they have played at least 10 seasons and
have been retired from the game for at least five years. During their initial period of eligibility,
until recently, a 15 year period, they are eligible for selection by the Baseball Writers'
Association of America. After that period has ended, they may be selected by the veterans
committee for an indefinite period of time, which also considers and selects all non-players for
the Hall of Fame. While a player’s record and playing ability, which can be measured by the
statistics captured their career, are a major factor in the guidelines for selection to the Hall of
Fame, the guidelines also specify that in addition to a player’s performance and contributions to
the teams he played for his character, integrity, and sportsmanship, should also be considered
MBL HOF selection Page 6 of 37
(Election Rules, n.d.). This of course allows for an element of subjectivity in the evaluation of
candidates on the part of those voting if a player is worthy of inclusion in the Hall of Fame.
Sabermetrics is a term used to describe the statistical analysis of the performance data
collected on baseball players and games. Among its goals is to determine which are the most
productive and efficient players, both overall and in specific aspects of the game. The term
sabermetrics was coined by Bill James in 1980 and described it as quote the search for objective
Data Description
The data set used was the data set used by Cochran (2000) for his paper, and includes
basic statistics and several sabermetrics statistics on all position players eligible for selection to
the baseball Hall of Fame in the year 2000. The data set contains 1340 instances with 24
attributes representing the players offense even defensive statistics for their career, a categorical
attribute indicating which position they played (catcher (C), first base (1), second base (2), third
base (3), shortstop (S) , outfield (O), or designated hitter (D)) , and a class attribute indicating if
they were a member of the MLB Hall of Fame, and if elected by the baseball writers for the
veterans committee, or not in the MLB Hall of Fame. A key element of the data set is the
sabermetrics measure Total Player Rating (TPR) based on a proprietary formula developed by
Thorn and Palmer which attempts to provide the best overall measure of a player’s value over his
career.
MBL HOF selection Page 7 of 37
Descriptive analysis
Table 1 below shows the numeric attributes in the database and their range of values,
while Figure 1 shows all of the attributes distributions. The first 13 attributes, from seasons
played to times caught stealing, or the totals for a player’s career. The next four values represent
commonly reported averages/percentages calculated from the career totals. The remaining values
represent more advanced sabermetrics. This represent weighted combinations of the player’s
career totals. A detailed description of how these values are calculated is in Albert’s (2010) paper
with the exception of total player rating. Total player rating is a formula developed by Thorn and
Palmer, which is a weighted sum of adjusted batting runs, fielding runs and base stealing runs.
The formula was adjusted annually and the results were available in their annual baseball
encyclopedia, titled Total Baseball, which was last published in 2004. The data is now known as
batter – fielder wins, and is only available with an ESPN insiders paid subscription (Yawdoszyn,
2006).
The data in table 2 shows the number of players by primary position played and the
number of players at that position in the Hall of Fame. This data includes the updates made to
reflect player in the database that were selected for the Hall of Fame after 2000. Overall, the
distribution of players by position matches the frequency one would expect with two exceptions.
First, the designated hitter position has only a small number of eligible players since it is a
relatively new non-fielding position that exists only in the American League and is frequently
rotated among players on a team to give them a break from their normal feeling position. Second,
catchers are relatively over represented on the list of eligibles, at almost 19% of the database
when the expected range would be between 11 and 13%.Overall a little over 9% of all players
eligible have been selected for the Hall of Fame, however, catchers and third baseman are
selected at about half the average frequency, while first baseman have the highest percentage of
position players selected for the Hall of Fame. Since the American League went to the designated
hitter rule, only a small number of players, who spent the majority of their career as a designated
hitter, have become eligible for the Hall of Fame and none of them have been selected to the Hall
of Fame when the dataset was created. Since the cutoff date for this database, one player, Paul
Molitor, who spent a significant part of his career as a designated hitter has been selected for the
Hall of Fame but he spend enough time as a position player that he is assigned a fielding position
in this database. Currently, there exists a perception that the baseball writers Association voters
downgrade career designated hitters when evaluating them as candidates for the Hall of Fame
Since the database was created an additional 13 players who were not included in the
database have become eligible for and been selected for membership in the Hall of Fame. I
created an additional test data set to use in testing the classification models producing the best
results with these 13 players plus 39 additional players, who are not in the Hall of Fame and are
no longer being considered by the Baseball Writers Association. I was able to access statistics at
career total information in several of the sabermetrics data sets. For those data attributes not
found at the baseball reference.com website. I was able to use the formulas in Albert’s (2010)
paper to calculate the missing data, with the exception of the total player rating. I was able to get
the TPR for all of the new Hall of Fame players from a blog post (Yawdoszyn, 2006) discussing
Hall of Fame candidates and a few other players in terms of their TPR scores. For those
sabermetrics values calculated based on the Albert’s (2010) paper I was able to verify the
calculations by calculating values for players in the database and confirming that the results were
close matches, allowing for rounding errors. However, I was unable to consistently replicate the
TPR calculations, especially for first baseman, shortstops and catchers. Therefore, I had to
restrict players included in the second test data set to those players for which I could obtain a
TPR score.
MBL HOF selection Page 12 of 37
For the primary data set, instances with missing values were not deleted. The data set was
updated to indicate selection for the Hall of Fame for the 10 players in the database selected for
the Hall of Fame since the year 2000, for their records as players. Three other people in the
database were selected for the Hall of Fame, for their record as managers and were left coded as
nonmembers. Player’s names are string values and were removed from the data set prior to doing
the analysis. There were only two sets of categorical attributes in the data set, one for position
played, and when indicating if in the Hall of Fame, and if so how elected. For this analysis I
shifted the Hall of Fame attribute to two values, indicating a member or not a member, and
removed the distinction between elected by the baseball writers or the veterans committee.
Given that I decided to exclude any players for which I could not obtain a TPR value as
discussed in the prior section, there were no problems with missing data for the players included
in the second data set of those players eligible after 2000. I also decided to exclude those players
have been associated with steroid use in association with the release of the Mitchell Report.
There’s been considerable speculation that the low number of votes received by those players in
recent Hall of Fame voting is due to voters concerns about the players fitness for the Hall of
Fame when considering integrity and sportsmanship, since the use of these performance-
enhancing drugs was cheating on the part of the player (Mills and Salaga, 2011), (Yawdoszyn,
2006), (Young, Holland, & Weckman, 2008). For this reason I did not include any of those
Young et al. (2008) in their paper combined first and third base into a category labeled
corner infield and second base and shortstop positions into a category called middle infield. I
MBL HOF selection Page 13 of 37
decided it was not appropriate to group positions played into a hierarchy, while players at second
base and shortstop are selected to the Hall of Fame at approximately the same rate, and players
from third-base are selected to the Hall of Fame in about half the rate is players from first base.
Methods
Choice of a model
In deciding which models to use, I did a literature review to see what other papers or
available that addressed this topic and which models were used. The other papers are discussed
in the comparison with similar studies section. I then ran those models used from the other
studies, which I found interesting, along with additional models/algorithms that I was interested
in exploring against the base case data set and compared the results to the results of running the
ZeroR model against the base case. ZeroR is a simple classifier which assumes that everything is
the most common class value; in this case that everyone is not in the Hall of Fame. This helped
me to delete options which did not appear to produce reasonably accurate results. I then
narrowed the models to be used for this paper down to three classification models, Naïve Bayes,
J48 decision tree, and logistic regression, in order to keep the scope to a reasonable level.
Naïve Bayes is a statistical-based classification method that predicts class for an instance
based upon the combined probability that the values of each individual attribute to indicate
membership in the class. Naïve Bayes assumes that the value of each individual attribute is
independent of the values of the other attributes. This is not always the case, and therefore faith-
based classification normally works better with a reduced attribute set that eliminates redundant
MBL HOF selection Page 14 of 37
attributes or minimally contributing attributes. Naïve Bayes also has an advantage in that it
handles missing values by simply omitting them from the calculations that specific instance and
it handles cases of zero on occurrences by using the Laplace estimator which allows for a small
but nonzero likelihood that it could occur in the future (Han & Kamber, 2006) (Witten, Frank, &
Hall, 2011).
The J48 decision tree algorithm is the implementation of the C4.5 decision tree induction
algorithm. These are described as greedy or non-backtracking algorithms which build their
structure based on a divide and conquer strategy and those branch splitting, i.e., which attribute
among those remaining will be used for the next level in the three, is based on greatest
information gain. For decision trees, pruning is used to reduce the risk of overfitting, reduce
complexity and increase computational efficiency. The end result is a set of rules in a tree
structure that can be applied to determine which class to assign a new instance of data (Han &
Logistic regression
Regression techniques are statistical methods that are easily adaptable for classification
where there is numeric attributes. Regression techniques use the training instances to calculate
the probability of class membership as a function of the value for each attribute. When
evaluating a new instance the probabilities are calculated and the largest is selected as the class
value. Logistic regression is a generalized linear model that avoids some of the problems of
linear regression. It does not assume the data for an attribute is normally distributed, which
figure 1 shows is not the case for several of the attributes in this data set. Second, it does not
MBL HOF selection Page 15 of 37
approximate the class values (0 or 1) directly, which can produce out of range probability values,
instead it approximates the values using the logit transformation function (Han & Kamber, 2006)
Experimental design
In conducting the test there were two objectives. First, to determine the relative
effectiveness of the three different classification systems examined and second to evaluate the
effect of removing the total player ranking (TPR) score from the data set and the benefit of
reducing the number of attributes in the data set. These experiments each model was evaluated
against 4 data sets as shown in Table 3. One immediate impact of the removal of the TPR from
the database was that the number of reduced attributes went from 7 to 10, when the TPR was
removed from the data set. The Weka Explored function, using 10 fold cross validation, was used
to train and evaluate each classification method against each of the data sets. The evaluation
model was saved and then used on the second test data set containing player eligible since 2000.
Weka experimenter function was also used to evaluate the relative utility of the four data sets
free to the classification algorithms. In addition the Weka Experimenter capability was used to
create an experiment to evaluate the 3 models against the four data sets.
The summary output, accuracy results and confusion matrix results for the Naïve Bayes
results are shown in table 5. In general the Naïve Bayes classification appears to favor accurately
classifying a high percentage of HOF members at the expense of including a high number of
false positives (classifying non-HOF members as members). This trend was evident in both the
Table 6 shows the results the T-test results (.05 level of confidence) for selected measures
of effectiveness (MOE) using the Weka experiment function to determine if the different data
sets had an impact on the quality of the prediction made, The table below shows that the results
achieved with the two reduced attribute data sets were significantly better than performance for
the base case, and that removing the TPR from the data set produced significantly worst results.
Weka experiment rank results are shown in table 7. The ranking function showed that
when using the Naïve Bayes classification there was a significant difference in the percent
correct and F-measure results for each of the four data sets. The results show that a reduced
attribute set is most important and that the TPR ranking also provides significant information.
Table 8 below shows the summary output, accuracy results and confusion matrix results
for the for the J48 decision tree algorithm. The J 48 algorithm produced trees of size 17 to 13
with 9 to 7 leaves. Unlike the Naïve Bayes algorithm, the J 48 decision tree algorithm tended to
produce confusion matrices with a smaller percentage of Hall of Fame members correctly
identified, but also with a significant lower number of false positives (nonmembers identified as
members). When the J48 models for each of the data sets were tested against the test data set. It
tended to do better than the Naïve Bayes by correctly predicting a high percentage of Hall of
MBL HOF selection Page 21 of 37
Fame members with a low number of false positives. The one exception was for the base case
data set with the TPR attribute removed. In that particular instance, the model only identified half
The T-test results, using the Weka experiment function for the J48 decision tree, show
there were no significant differences in the MOEs from the different data sets used in this
experiment. In addition, there was no rank differentiation between the poor data sets for this
algorithm.
Logistic regression
Logistic regression results are shown in table 10 below. As was the case with the J48
decision tree, for the training data logistic regression classification developed a classification
scheme that tended toward lower false positives at the expense of more false negatives relative to
the Naïve Bayes classification scheme. One run against the test data set, as the confusion
matrices show, logistic regression models had a very high accuracy rate, correctly identifying all
MBL HOF selection Page 25 of 37
our all but one Hall of Fame members, with no false positives with one exception. The
classification model developed from the reduced attributes set when total player rating was not
available had no false positives, but only correctly identified eight out of 13 Hall of Fame
members.
The results from the Weka experiment function, in table 11 below, show that there was no
significant difference in the MOE’s from the different data sets when the base case data set was
used as the point of reference. That is the results from the other data sets were not significantly
better or worse than the results achieved with the base case.
However, as table 12 shows, when the t-test was used to rank the four different data sets.
Overall, the reduced attribute data set was ranked first in the base case without the TPR attribute
was ranked last. The reduced attribute data set was significantly better than the base case with the
TPR attribute removed, while there was no significant difference between it and the other two
data sets or between them and the base case with TPR removed. This applies when the data sets
were evaluated on percent correct, F – measure, and area under ROC MOE’s.
MBL HOF selection Page 29 of 37
As an initial comparison of the three models I did one comparison using the Weka
experiment environment where I compared to three models with the ZeroR algorithm as a base
case for comparison. The ZeroR algorithm uses a rule that assumes all entities are assigned to the
most common class value, and serves as a minimum standard which any algorithm under
consideration should exceed. Table 13 below shows the ranked results for the four algorithms
with the base case data set, assessed for the F – measure and area under the ROC MOE’s. The
percent cranked ranking was similar except that the positions of J48 decision tree and Naïve
Bayes were reversed. This demonstrates that all three of the algorithms produced significantly
Table 13. Weka classification algorithm ranking with ZeroR base case
The ROC visual threshold curves for these four algorithms are shown in Figure 2 below.
The ZeroR results were in the upper left and show a straight 45° line reflecting the low quality of
MBL HOF selection Page 30 of 37
its simplistic classification rule. The curves for the Naïve Bayes and the J48 decision tree
functions are almost identical as are there area under ROC values, while that for the logistics
regression is not quite as good, reflecting its tendency for more false negatives in order to
the three classification algorithms against each of the four data sets and did a combined
evaluation of the results. All evaluations used the corrected T-test algorithm in Weka at the .05
level of confidence. Focusing on the models first, the results below were developed using the J48
decision tree algorithm is the base for comparison. This was selected in order show maximum
discrimination between the three algorithms. Tables 14, 15, and 16 show the results for the three
MBL HOF selection Page 31 of 37
primary MOE’s percent correct, F – measure, and area under ROC. The results of the ranking
algorithm are shown in tables 17 (percent correct and F – measure) and 18 (area under ROC).
The overall results showed that the logistics regression algorithm performed best, followed by
the J48 decision tree, and then the Naïve Bayes. However when area under ROC is considered
then the Naïve Bayes algorithm performs better than the J48 decision tree algorithm.
Table 17. Weka classification algorithm ranking F – measure and percent correct
The relative utility of the data sets used has been discussed as part of the discussion for
each algorithm. As one would expect the combined results, shown in table 18, are in line with the
individual results. Using the Weka ranking function, with a T-test confidence interval of 0.5,
there was a clear and distinct ranking of the data sets, for each data set was significantly different
from the others. Overall, using a reduced attribute data set was more important than the presence
or absence of the total player ranking, however, having it in a data set was significantly better
Table 18. Weka data set ranking for F – measure (all classification algorithms)
Braun, K., Hartz, B., Leyhane, J., & McGee, D (2006) in their paper used data that
included number of times a player was selected as an All-Star and number of awards for leading
in statistical categories that a player received. But they did not consider data on base running and
fielding performance. They also ultimately decided to only consider players in the post-World
War II era since, as they note, some argue that many of the players selected from the earlier are
essentially mistakes that do not deserve to be in the Hall of Fame based on their career statistics
(p. 3). In their initial test, they use Naïve Bayes, JRip, and random forest classifiers. However,
they dropped Naïve Bayes and used the other two both individually and with meta-classifiers,
AdaBoost with JRip and GainRatio with random forest. Their results used the F measure as their
primary measure of effectiveness and produced values of 0.72 for JRip and 0.75 for random
forest, which were comparable to my results using J48 and the full attribute data set.
Young et al. (2008) investigated the use of neural networks to forecast Hall of Fame
selection for position players. The attributes considered included player position, basic career
batting totals, base running totals and fielding totals plus total performance and character awards
received by the player. They also excluded designated hitters due to their lack of fielding
statistics. They used under supervised K means algorithm, which initially produced 10 clusters
approximately half a which contained Hall of Fame players. They reported after considerable
testing that they were able to achieve approximately a 98% accuracy rate for classifying players
as to whether they were in the Hall of Fame or not. This was something that I only managed with
MBL HOF selection Page 34 of 37
the test data set for the J 48 decision tree and logistic regression models, but did not meet with
Mills and Salaga (2011) use the random forest classification algorithm to forecast
probability of Hall of Fame induction for current and recently retired players who have played
for at least 10 years. They used data available from baseball reference.com’s subscription
database to extract what they consider to be the traditional batting and baserunning career totals
and averages along with the total number of times a player was selected as an All-Star. The
players were divided into two data sets, first, a training data set of all players who retired after
1950 and were eligible for or in the Hall of Fame, and second, a test data set consisting of all
players who retired after 1989 and were not in the Hall of Fame, as of 2009. They observed that
home runs in total All-Star selections were two of the most important attributes in forecasting
Hall of Fame selection. They ran multiple models, all of which had a very low misclassification
rate for identifying non-Hall of Fame players. However, the models misclassified between 10%
and 23% of the Hall of Fame players as Non-Hall of Fame, with an overall OOP rate of 1% to
2.6%. These were similar to the results achieved with the J48 decision tree in the logistic
regression models.
The analysis conducted as part of this project as well as the other related studies show
that selecting a Hall of Fame quality player is not a clear-cut statistical based system, however I
accuracy’s can be achieved. The experiment and analysis showed that the logistics regression
algorithm performed better than the other two algorithms over the full range of cases. In general,
the J48 decision tree performed better than the Naïve Bayes algorithm, however, under some
MBL HOF selection Page 35 of 37
measures naïve Bayes performed better. One issue for consideration in deciding whether Naïve
Bayes or J48 decision tree would be preferred could depend upon which standard one might wish
to apply. Specifically, is it more important to capture a high percentage of Hall of Fame members
correctly, while accepting a high number of false positives which could be equated to identifying
people not in the Hall of Fame who should be considered, which reflects the Naïve Bayes
classification process. Alternatively, as reflected by the J48 decision tree, would it be better to
have a more restrictive process which perhaps identifies Hall of Fame members who could be
In addition to addressing the advantage of reducing the number of attributes, the paper
also looked at the value of TPR, as an example of a more complex sabermetrics score. The neck
conclusion is that a reduced attributes set is more important for improving accuracy; however
The database that was used for this study did not address metrics that some of the other
studies did such as number of achievement awards which play a significant role in the selection
process (Mills and Salaga, 2011), (Young et al., 2008). If that data were available. It would be
The conduct of this project proved to be a good experience in learning the capabilities of
the Weka data mining system and several the algorithms that uses. One area where I would like
to devote more time is working with the clustering algorithm. I did some initial work with it but
could not extract the statistical data I wanted to use to more fully understand the results. I need to
work more with Weka to determine if was simply my lack of knowledge of Weka’s capabilities,
References
Albert, J. (2010). Sabermetrics: The Past, the Present, and the Future. Mathematics and Sports,
1)
reference.com/players/
Birnbaum, P. (n.d.). A Guide to Sabermetrics. Society for American Baseball Research. Retrieved
from: http://sabr.org/sabermetrics
Braun, K., Hartz, B., Leyhane, J., & McGee, D. (2006). Determining a Baseball Hall of Fame
Cochran, J. "Career records for all modern position players eligible for the major league baseball
http://www.amstat.org/PUBLICATIONS/JSE/secure/v8n2/datasets.cochran.new.cfm
http://baseballhall.org/hall-of-famers/bbwaa-rules-for-election
Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques. San Francisco, CA:
Mills, B. M., & Salaga, S. (2011). Using tree ensembles to analyze National Baseball Hall of
http://www.brianmmills.com/uploads/2/3/9/3/23936510/1mills__salaga_-
_random_forest_baseball_hof_jqas_2011.pdf
Witten, I., Frank, E., & Hall, M. (2011). Data Mining : Practical Machine Learning Tools and
Yawdoszyn, M. (2006). Hall of Famers' Sabermetrics Rankings. [Blog post]. Retrieved from:
http://mysite.verizon.net/vze2x57w/sabermetrics/id1.html
Young, W. A., Holland, W. S., & Weckman, G. R. (2008). Determining hall of fame status for
major league baseball using an artificial neural network. Journal of Quantitative Analysis