April 2, 2014 Baseball Data Analysis Project Frequently in statistics a sample of an entire population is collected and analyzed in order to infer or draw conclusions regarding that entire population. This project consists of a number of different parts using both categorical variables and quantitative variables but the main purpose is to analyze several samples to see how they compare to the entire population. Do our sample statistics accurately predict the population statistics? That is the question we are going to answer throughout this project. We choose a data set that included various baseball statistics including positions, homeruns, batting averages and more. For the first part of the project we chose the categorical variable or primary player positions. We pulled 2 sample sets of the data using the systematic and random sampling methods. Below you will see graphs of the entire population of baseball players, 1,341, and graphs of each sampling method. After the graphs there will be a short analysis of our findings.
ENTIRE POPULATION: Positions played in baseball
492 254 154 148 145 139 8 Entire Data Population for all players Outfield Catcher Shortstop 2nd Base 3rd Base 1st Base Designated Hitter 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 0 100 200 300 400 500 600 Frequency Cumulative Percentage
SAMPLING METHOD 1: SYSTEMATIC sampling of positions played in baseball
17 6 5 4 2 2 0 Systematic Sampling of 36 players Designated Hitter 2nd Base 3rd Base Shortstop 1st Base Catcher Outfield 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 0 2 4 6 8 10 12 14 16 18 Frequency Cumulative Percentage SAMPLING METHOD 2: RANDOM sampling of positions played in baseball
8 8 6 6 5 3 0 Random Sampling of 36 players Catcher 3rd Base Outfield Shortstop 1st Base 2nd Base Designated Hitter 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 0 1 2 3 4 5 6 7 8 9 Frequency Cumulative Percentage Reflection The two sampling methods used to create the samples are Systematic Sampling Method and Random Sampling Method. Systematic sampling is created by assigning every baseball player in the population and unique number in chronological order then taking every nth player. In this case we had 1,341 players so we assigned a number 1-1,341 to each player. We wanted a sample of 36 players so we placed them in chronological order and took every 37 th player to create the sample. For the Random Sample it is created exactly like it sounds, 36 random players were selected. Both of these sampling methods were created with Mod and RAND formulas in Microsoft Excel. Both sampling methods were fairly accurate in predicting the population statistics. In the population the Outfield has the highest number of players as their primary position. In both samples it was the highest as well. The systematic sampling had a higher number of outfielders and was slightly better at resembling the population than the random. The random sample actually shared outfield and 3 rd base as the largest amount of players having that as their primary position. In the population there are only 8 Designated Hitters. Both of our sampling methods mirrored the small percentage in the entire population by containing no designated hitters. Although both sampling methods did resemble the entire population, by a visual appearance of the pie graphs the Systematic Sampling Method did a better job in predicting the population as a whole.