Professional Documents
Culture Documents
Explanation For Project Major League Baseball
Explanation For Project Major League Baseball
Explanation For Project Major League Baseball
Group (9)
Student Name ID
Puja Chakraborty 110120227
Binny Kaur 110114712
Jothi Prakash Murugan 110117643
Suhani Prajapati 110119218
Deepack Ravichandran 110119595
Team Members’ Contributions
1. Puja Chakraborty
a. Continuous Probability Distribution
b. Sampling Methods and Central Limit Theorem
c. Estimations and Confidence Intervals
d. Compiling the final work including alignment, word file processing
and work validation.
2. Binny Kaur
a. Creating Files and folder, sharing those among group members and
refining initial data.
b. Introduction to Statistics
c. Frequency Tables, Distributions and Graphs
d. Numerical Measures
e. Displaying and Exploring Data
3. Jothi Prakash Murugan
a. Steps of Hypothesis Testing
b. Nonparametric methods: Nominal Level Hypothesis
c. Correlation and Linear Regression
4. Suhani Prajapati
a. A Survey of Probability Concepts
b. Discrete Probability Distribution
5. Deepack Ravichandran
a. Steps of Hypothesis Testing
b. Nonparametric methods: Nominal Level Hypothesis
c. Correlation and Linear Regression
Project Major League Baseball
1. Introduction to Statistics
Consider the following variables: number of wins, payroll, season attendance, whether the
team is in the American or National League, and the number of home runs hit.
a. Which of these variables are quantitative and which are qualitative?
Variable Type
Number of Wins Quantitative
Payroll Quantitative
Season Attendance Quantitative
Team classification (American or National Qualitative
League)
Number of home runs hit Quantitative
3.Numerical Measures
Refer to the team salary variable, include the answer to the following questions in your
report.
a. Around what values do the data tend to cluster? Specifically, what is the mean team
salary? What is the median team salary? Is one measure more representative of the typical
team salary than the others?
Mean Team Salary = Total Team Salary / Number of Teams = 3162.18/30 = $105.4 million
Median Team Salary is the middle value of the data when data is arranged in ascending order.
Since we have even number of teams. We take average of two middle values. As a result
Median Team Salary = $100.43
Data tends to cluster around the mean value of $105.40 million. Median Team Salary is $100.43.
Mean is the more accurate representation of Team salary as more values approach this number.
b. What is the range of the team salaries? What is the standard deviation? About 95% of the
salaries are between what two values?
Range = Maximum Value – Minimum Value
Maximum Salary = $194.08 million (Team Chicago Cubs)
Minimum Salary = $49.08 million (Tampa Bay Rays)
Range of Team Salaries = $194.08 - $49.08 = $145 million
Standard Deviation = 36.73 (considering the given data as population)
Using Empirical Rule to find out the range in which 95% of the salaries exists between $178.88
million and $31.93 million.
Lower Limit = Mean – 2 × Standard Deviation = 31.93
Upper Limit = Mean + 2 × Standard Deviation = 178.88
4.Displaying and Exploring Data
a. In the data set, the year opened is the first year of operation for that stadium. For each
team, use this variable to create a new variable, stadium age, by subtracting the value of the
variable year opened from the current year. Develop a box plot with the new variable,
stadium age. Are there any outliers? If so, which of the stadiums are outliers?
Please refer to Box Plot for new variable, Stadium Age.
b. Using the variable salary create a box plot. Are there any outliers? Compute the quartiles.
Write a summary of your analysis.
Please refer to the box plot for the variable, Salary. As we can see that salaries range from 49 to
194, with median of salary of around 83. Most of the salaries fall in the Interquartile range and
there are no outliers for variable salary.
Box Plot Variables for Salary
Minimum Value 49.08
Quartile 1 80.21
Median (Quartile 2) 100.43
Quartile 3 126.82
Maximum Value 194.08
Mean 105.41
Interquartile Range 46.60
Lower Outlier Limit 10.31
Upper Outlier Limit 196.72
c. Draw a scatter diagram with the variable wins on the vertical axis and salary on the
horizontal axis. Compute the correlation coefficient between wins and salary. What are your
conclusions?
Please refer to scatter diagram.
100
80
Wins e
60
40
20
0
0.00 50.00 100.00 150.00 200.00 250.00
Salray
We must determine the percentage of teams that won 90 or more games in order to determine the
likelihood that a team will win 90 or more games. Out of the total of 30 teams, 8 teams in this
instance won 90 or more games, so the likelihood that a team will win 90 or more games is
2. In the playoffs, only 10 teams can enter the playoffs. Based on the 2022 season, what is the
probability that a team that wins 90 or more games makes the playoffs?
We know that there are 10 teams won 90 or more games and in order to determine the
likelihood that a team that wins 90 or more games makes the playoffs based on the 2022
season, we use:
𝑇𝑜𝑡𝑎𝑙 𝑡𝑒𝑎𝑚𝑠 𝑡ℎ𝑎𝑡 𝑤𝑜𝑛 90 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑔𝑎𝑚𝑒𝑠
Probability = 𝑇𝑜𝑡𝑎𝑙 𝑡𝑒𝑎𝑚𝑠 𝑡ℎ𝑎𝑡 𝑐𝑎𝑛 𝑒𝑛𝑒𝑡𝑒𝑟 𝑝𝑙𝑎𝑦𝑜𝑓𝑓𝑠
= 10/10 = 1 or 100%
A team that wins 90 or more games will therefore have a 100% chance of making the playoffs in
the 2022 season.
3. Make a statement based on your responses to parts (1) and (2).
(1) Only 33% of the teams in the 2022 season had 90 or more successes, which is less than the
average opinion that 90 or greater wins are required to make the playoffs.
(2) At 100%, there is a good chance that at least one of the teams that won 90 or more games will
make it to the playoffs. This indicates that having a winning record of 90 or more games is a
reliable predictor of making the playoffs.
b. Presently the National League requires that all fielding players, including pitchers, take a
turn to bat. In the American League, teams can use a designated hitter (DH) to take the
pitcher's turn to bat. For each league, create a frequency distribution and a relative
frequency distribution of teams based on the season total of home runs. For the frequency
distributions, start the first class at 140 home runs and use a class interval of 30.
Frequency distribution of Home Runs in the American League:
HR Count of HR Relative Frequency
140-169 2 13.33%
170-199 1 6.67%
200-229 5 33.33%
230-259 4 26.67%
260-289 1 6.67%
290-319 2 13.33%
Grand Total 15 100%
2. In the National League, what is the probability that a team hits 200 or more home runs?
The probability that a team hits 200 or more home runs in National League =
𝑇𝑜𝑡𝑎𝑙 𝑡𝑒𝑎𝑚𝑠 𝑤𝑖𝑡ℎ 200 𝑜𝑟 𝑚𝑜𝑟𝑒
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑎𝑚𝑠
= 12/15 = 0.8
a. For the variable salary, compute the mean, median, range, standard deviation, and
coefficient of skewness. Also, make a box plot for the variable, salary. Does it seem reasonable
that salary is normally distributed? Explain.
For the salary variable, the mean is 105.406, the median is 100.43, the range is 145.00, and the
standard deviation is 37.364. The coefficient of skewness is 0.39953 and from this, it can be
interpreted that the distribution of the value is approximately symmetric.
From the boxplot where median is closer to Q1 than Q3, indicating a slightly left-skewed
distribution/ positively skewed distribution. Therefore, based on the calculations and box plot, it
seems reasonable to assume that the salary variable is not normally distributed,
b. Compute a new variable, stadium age, by subtracting the year the stadium was built from
2020. For the variable stadium age, compute the mean, median, range, standard deviation,
and coefficient of skewness. Also, make a box plot for the variable, stadium age. Does it seem
reasonable that stadium age is normally distributed? Explain.
Based on the calculations, the mean stadium age is 29.37 years, with a median of 20.5 years. The
range of stadium ages is quite large, from 3 to 108 years, with a standard deviation of 25.12 years.
The coefficient of skewness is positive, with a value of 1.06, indicating that the distribution is
highly skewed to the right.
From the box plot, we can see that the distribution is left-skewed/ positively skewed, as the median
is closer to Q1 than Q3.
So, it can be concluded that the variable stadium age is not normally distributed.
Therefore, the likelihood of a sample mean attendance as large or larger than 2.283 from the
population is approximately 0.0985 or 9.85%.
b. Using a 5% significance level, conduct a test of the hypothesis to determine whether the
mean attendance was more than 2 million per team.
To determine whether the mean attendance of the team was more than 2 million, we first state the
null and alternate hypothesis.
Step 1: Stating Null and Alternate Hypothesis
H0: μ ≤ million
H1 : μ > 2 million
Step 2: Selecting Level of Significance
We should select the level of significance α. From the Question, we get to know that,
α = 0.05
Step 3: Select the test Statistic.
We should select the test statistic. Here we select the test statistic t.
Where, t = (X ̅- μ)/(s/√n)
Step 4: Formulate the Decision rule.
From Step 1 we get to know that this is a one tailed test and by using the level of significance α =
0.05 we can establish the conditions under which null hypothesis can be rejected or not. By using
the t distribution table, and with the degree of freedom being 29 which is found by the following
formula,
Degree of freedom = n -1
We can infer that the critical value is 1.699 and the conditions which null hypothesis can be
rejected or not can also be established. The conditions are:
Decision Rule: If the computed value of z is not between −1.699 and 1.699, reject
the null hypothesis. If z falls between −1.699 and 1.699, do not reject the null
hypothesis.
To determine if there is a relationship between salaries and winning, We can use a contingency
table which will be able to test whether two tests or characteristics are related.
From the raw data, we can establish a contingency table using the conditions given in the above
question.
Contingency Table
High Salary Low Salary Total
Winning 9 7 16
Not Winning 6 8 14
Total 15 15 30
f0 fe f0 fe
High Salary Low Salary
Winning 9 8 7 8
Not Winning 6 7 8 7
3,000,000
2,500,000
2,000,000
1,500,000
1,000,000
500,000
0
0.00 50.00 100.00 150.00 200.00 250.00
Salary
From the above plot, we can observe a direct relationship as there is a positive correlation between
attendance and team salary which is 0.4037. This can be considered as a moderate positive
correlation.
b. What is the expected attendance for a team with a salary of $100.0 million?
y = 12960x + 917096
for the expected attendance for a team with a salary of $100 million:
y = 12960(100) + 917096
y = 1296000 + 917096
y = 2,213,096
A total number of 2,213,096 attendees is the expected attendance for the team.
c. If the owners pay an additional $30 million, how many more people could they expect to
attend?
For an additional $30 million, the total salary will be $130 million:
y = 12960(130) + 917096
y = 1296000 + 917096
y = 2,601,896
A total number of 2,601,896 attendees is the expected attendance for the team.
d. At the .05 significance level, can we conclude that the slope of the regression line is
positive? Conduct the appropriate test of the hypothesis.
f. Determine the correlation between attendance and team batting average and between
attendance and team ERA. Which is stronger? Conduct an appropriate test of the hypothesis
for each set of variables.
From the above one-tail test there is no sufficient evidence to reject the null hypothesis H0.
Hence, there is a negative correlation between Team ERA and attendance.