LAB 2 - Group 20

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

COURSE: STAT 151

SECTION: D1

LAB #:​​2

GROUP #: ​20

GROUP LEADER: ​Ayaan Mallick

GROUP MEMBER(S): Anant Gupta​


Ayaan Mallick
Estefania Santos

PLEASE FILL OUT THIS SECTION ONLY IF YOUR GROUP MEMBER(S)


MISSED ANY OF THE DEADLINES:

MEMBER(S) EXCLUDED FROM THIS LAB​​WHICH DEADLINE IS MISSED ​

No members excluded from the lab

SUMMARY OF ISSUES (FURTHER EXPLANATION OF WHY THE ABOVE


GROUP MEMBER(S) IS(ARE) EXCLUDED FROM THIS LAB REPORT):

No deadlines missed
1) Suppose the amount of cola dispensed by a filling machine follows a normal
distribution with a mean (μ) and a standard deviation (σ). Select the Distributions
option in the R commander menu and then the Normal distribution among
continuous distributions options. This allows you to obtain a graph of the normal
density function, and to calculate normal probabilities when the parameters (μ and
σ) are provided. Use R commander to answer the following questions. (Hint:
Numerical answers for parts (b) and (c) should be rounded to three decimal places.)
a) Assume that the mean amount dispensed by the machine is set at μ = 8 oz.
Describe what happens to the percentage of underfilled bottles (the bottles
containing less than 8 oz) when σ decreases or increases? In general, how
does the magnitude of the standard deviation affect the filling process?
Ans. The area under the probability density function always has to be equal to 1.
Standard deviation by definition is the measure of the spread or dispersion. In this
case, as the value of the standard deviation increases, the density of the graph
decreases which shows that the values have become more spread out or dispersed
around the mean, meaning more values will be found away from the mean. If the
standard deviation is equal to zero then there is no variance and all the bottles
filled must contain 8 oz while at the same time standard deviation can also not be
negative.

In the case of probability or the likeliness of the bottle to be underfilled, from the
graphs above we can conclude that the probability will remain at 0.5, no matter
the change in standard deviation; there is no change in this value.

b) Now assume that the mean amount dispensed by the machine is set at μ = 8.1
oz. Enter the value of σ as 0.1 oz. Calculate the percentage of underfilled
bottles (the bottles containing less than 8 oz) in this case. What is the
percentage of underfilled bottles if σ were 0.05 oz and 0.04 oz? In general,
what is the effect of decreasing σ on the percentage of underfilled bottles?

Ans. Percentage when μ = 8.1 and σ = 0.1: 0.1586553 x 100 ≈ 15.866%


Percentage when μ = 8 and σ = 0.05: 0.02275013 x 100 ≈ 2.275%
Percentage when μ = 8 and σ = 0.04: 0.006209665 X 100 ≈ 0.621%

From the above observations, while keeping the mean constant at 8.1, as
we decrease the standard deviation the percentage of the underfilled
bottles also decreases. Thus we can assume that decreasing the standard
deviation also decreases the percentage of underfilled bottles since there
is less variance and less values that stray from the mean.
Percentage when μ = 8.1 and σ =
0.1: 0.1586553 x 100 ≈ 15.866%

Percentage when μ = 8 and σ = 0.05:


0.02275013 x 100 ≈ 2.275%

Percentage when μ = 8 and σ =


0.04: 0.006209665 X 100 ≈
0.621%
c) Now set the standard deviation to 0.05 oz and change the mean. Enter the
value of µ as 8, then 8.05, and eventually 8.1 oz. Calculate the percentage of
underfilled bottles in each case. Describe briefly how the shape of the
corresponding curve changes. How does changing the value of µ affect the
filling process? Does the percentage of underfilled bottles increase or
decrease? Do not print the density curves.

Ans. Percentage when μ = 8 and σ = 0.05: 0.5 x 100 = 50%


Percentage when μ = 8.05 and σ = 0.05: 0.1586553 x 100 = 15.866%
Percentage when μ = 8.1 and σ = 0.05: 0.02275013 x 100 = 2.275%

Distribution shifts to the right as the mean increases, meaning that an increased
percentage of bottles receive the adequate or above the mark amount of cola.
When the value of mean is increased, the machine dispenses more cola which
reduces the percentage of underfilled bottles.
2) Consider a random sample of 400 bottles obtained from the population of all bottles
filled by the machine over a specific short time period. The volume of cola in each
bottle is determined. The 400 observations recorded in the first column Volume are
available in the data file Lab2-Q2.txt on eClass. Given the very large sample size, we
may assume that the distribution of the volume of cola in bottles in the sample (data
file) is close enough to the population distribution while its mean and standard
deviation are close to the population parameters (μ and σ).
a) Obtain a frequency histogram of the 400 observations with the bins starting
at 8.07, ending at 8.18, and using a width of 0.01. (Hint: R assumes that the
right endpoint of each interval is included. Your histogram should include
the left endpoints.) Paste the histogram into your report. The format of the
histogram should be the same as the format of the histogram in Lab 1
Instructions (labels at the axes, title).

b) Describe the shape of the histogram obtained in part (a). Does the histogram
support the claim of the company that the bottles are slightly overfilled?

Ans. The histogram is unimodal and asymmetric with the mode being in the range
8.10 to 8.11. Additionally it is a right skewed histogram. Yes, the histogram given in
part (a) supports the claim of the company that the cola bottles are slightly overfilled as
we can see that all the intervals mentioned are greater than 8.0 oz.
c) Obtain a Q-Q plot and a boxplot for the 400 observations. Add a title to each
plot. Paste both plots into your report. (TIP: Click “Options” and select
Outliers “(Interactively) with mouse” when you make the boxplot in R
commander to see to which observation the outlier(s) corresponds.) Is (are)
there any outlier(s)? Based upon the QQ-plot, does the distribution of volume
of cola in the bottles appear to be normal? What conclusions can be made
about the shape of the distribution from the Q-Q plot and boxplot? What
does the relationship between the whiskers tell us about the shape of the
distribution? Do the plots collectively confirm your findings in part (b) about
the shape of the distribution?
Q-Q plot:

The distribution is not normally distributed as most of the values do not lie on the
normal line. Most of the values are outliers as they lie outside the normal line.
The shape is right skewed and asymmetric.

Boxplot:

The boxplot is asymmetric as the values are not equally distributed on either side.
More specifically, the boxplot is right skewed as the upper tail is longer than the
lower tail. Some of the outliers are: 19, 25, 35, 55, 68, and 73

Overall:

None of the plots show normal distributions as they both are right skewed and
asymmetric. Yes both of these graphs concur with the findings in part b) about the
shape of the distribution: right skewness.

d) Obtain the summary statistics (mean, standard deviation, IQR, min, Q1,
median, Q3, max, and n) of the 400 observations. Paste the summary
statistics into your report. Briefly describe the relationship between the mean
and median, as well as the relationship between the three quartiles. Are the
relationships consistent with the observed shape of the histogram in part (b)?
Ans.

From the summary statistics above, we can see that the mean and median are
quite similar. We observe that the mean (8.1111) is greater than the median(8.11).
The First Quartile is 8.09, Second Quartile is 8.11 and the Third Quartile is
8.13. The three quartiles are close to each other with a small difference of 0.02
between each of them, the max being 0.05 bigger than the third quartile. As the
mean is slightly larger than the median, along with the slight difference in the
quartiles we can conclude that it is slightly skewed to the right in consistent
with the shape of the histogram in part(b).
Suppose that 200 packs are randomly selected, each consisting of 6 bottles of cola
obtained from the population of all bottles filled over a certain short time period.
The amount of cola in each bottle is determined. The measurements are saved in a
table consisting of 6 rows (sample size) and 200 columns (number of random
samples) that occupies the columns Sample1 – Sample200 in the lab2-Q3.txt file.

3. Obtain the mean amount of cola for each sample consisting of 6 bottles. Make
sure that all 200 columns are included in the panel of the “Numerical Summaries”
dialog box.

a) Obtain a frequency histogram of the 200 means with the bins starting at
8.08, ending at 8.15, and using a width of 0.005. (Hint: R assumes that the
right endpoint of each interval is included. Your histogram should include
the left endpoints.) Paste the histogram into your report. The format of the
histogram should be the same as the format of the histogram in Lab 1
Instructions (labels at the axes, title).
(b) Refer to the histogram obtained in part (a). Does the data appear to be normally
distributed? Compare the distribution of the means to the distribution of individual
observations studied in Question 2 in terms of their degree of skewness and spread.

The data from the histogram in a. appears to be normally distributed, though


potentially slightly right skewed.

The Histograms given in Question 2 and the one in part (a) are both unimodal. The
histogram in part (a) is more normal due to the fact that it is less skewed. The
histogram in Question 2 is also more spread out considering the fact that maximum value
in Q2 creates a bigger interval range than the one in part (a). The degree of spread in
Question 2 Histogram is 0.11 (oz) and for the histogram in part (a) is 0.05 (oz). We
observe that Q2 histogram is more spread out compared to the one in part (a).

(c) Obtain a Q-Q plot and a boxplot for the 200 means. Add a title to each plot.
Paste the plots into your report. Is (are) there any outlier(s)? Do the plots
collectively confirm your findings in part (b)? Compare the plots with the ones in
Question 2, part (c).
There are many outliers in the Q-Q plot, although the data is fairly close to the
line, while the box plot has 2 outliers which are Sample 138 of 8.145oz and
Sample 165 of 8.140oz.

Both the boxplot and the Q-Q plot are slightly skewed to the right, which
confirms the findings in part b.

(d) Obtain the sample size, mean, and standard deviation of the 200 means. Paste
the summaries into your report. Compare the values with the mean and the
standard deviation of the sampling distribution of the sample mean predicted by the
theory of sampling distributions. What does the standard deviation mean here?

According to the theory of sampling distributions, the standard deviation of the sampling
distribution of the sample mean predicted by the theory of sampling distributions is given
σ
by
𝑛
n = sample size
n=6
The standard deviation of the population is found to be 0.02612463
sample standard deviation is 0.02612463/sqrt(6) = 0.01066533 = 0.0107
This shows that the value above is the sample standard deviation of the sample mean.
Comparing them, we see that the means are equal, while the standard deviations differ
greatly.
The standard deviation shows how far the values would be from the mean, in this context
how overfilled or underfilled it would be.
Now suppose 200 boxes are randomly selected, each consisting of 30 bottles of cola obtained
from the population of all bottles filled over the same short time period. The amount of cola
in each bottle is determined. The measurements are saved in a table consisting of 30 rows
(sample size) and 200 columns (number of random samples) that occupies the columns
Sample1 – Sample200 in the lab2-Q4.txt file.

4. Obtain the mean amount of cola for each sample consisting of 30 observations. Make
sure that all 200 columns are included in the panel of the “Numerical Summaries” dialog
box.

(a) Obtain a frequency histogram of the 200 means with the bins starting at 8.09, ending at
8.13, and using a width of 0.003. Paste the histogram into your report. (Hint: R assumes
that the right endpoint of each interval is included. Your histogram should include the left
endpoints.) The format of the histogram should be the same as the format of the histogram
in Lab 1 Instructions (labels at the axes, title).
(b) Describe the shape of the histogram in part (a). Does the data appear to be
approximately normally distributed? Compare the histogram with the histogram obtained
in Question 2, part (a) and the one in Question 3, part (a). In particular, comment about
differences in degree of skewness and spread between each pair of graphs.

The histogram in part a) is unimodal and is not skewed, making it symmetric. The data appears
to be normally distributed.

In contrast with the histogram obtained in Question 2, part a), both histograms are unimodal.
However, Question 4’s histogram is not skewed, unlike Question 2’s which is right skewed.
Additionally, both histograms indicate that the bottles are slightly overfilled with cola. The prior
one having an overfilled range of 8.06 to 8.18, and the latter’s range being approximately 8.095
to 8.125. This means that the Histogram of volume of all observations coincides with the
Histogram of mean 200 samples containing 30 bottles.

Comparing the histograms in Question 3.a) and 4.a), we see that both histograms are unimodal.
However, the histogram for 4.a) is not skewed unlike 3.a) which is right skewed. The Histogram
of mean 200 samples containing 30 bottles has a smaller range of values (~8.095 to 8.125) than
the Histogram of mean of 200 boxes with 6 bottles with a range of 8.08 to 8.15. This indicates
that the prior data set (4.a)) has a higher standard deviation since there are more values that stray
from the mean.

(c) Obtain a Q-Q plot and a boxplot for the 200 means. Add a title to each plot. Paste the
plots into your report. Is (are) there any outlier(s)? Does it appear that the sample means
come from a normal distribution? Explain. Do the plots collectively confirm your findings
in part (b)? Compare the plots with the plots obtained in part (c) of Questions 2 and 3.
What do you conclude?

No, there aren't any outliers in the box plot or the Q-Q plot. Yes, it appears that the sample means
come from a normal distribution as the Q-Q plot follows a linear trend. From the following Plots
we observe that the points on the Q-Q Plot aren’t far away from the line (no outliers), that the given Plot
is normally distributed, thus proving our finding in Q4(b) where the histogram is normally distributed.
Very less skewness is seen here in comparison to the second and the third questions.

From the Q-Q Plot of Q4 we interpret that the Plot is normally distributed, comparing it to Q2 and Q3, as
the Plot in Q2 is more right skewed while the Plot in Q3 is slightly skewed to the right. This is likely due
to the increase in sample size. So, we can conclude that as the number of samples increases, the
skewness decreases.
(d) Use the Summary Statistics (Columns) feature to obtain the sample size, mean, and
standard deviation of the 200 means. Paste the summaries into your report. Compare the
value of the standard deviation of the sample mean for n = 30 with the standard deviation
of the sample mean in Question 3, part (d) (for n = 6). Compare the values with the mean
and the standard deviation of the sampling distribution of the sample mean predicted by
the theory of sampling distributions. Which sample mean tends to be a more accurate
estimate of the population mean?

According to the theory of sampling distributions, the standard deviation of the sampling
distribution of the sample mean is predicted by the theory of sampling distributions which is
given by σ/sqrt(n) where n = sample size.
n=30
The standard deviation of the population is equal to 0.02612463
Hence we can see that the sample standard deviation is 0.02612463/sqrt(30) = 0.00476968 =
0.00477.
This shows that the value above is the sample standard deviation of the sample mean.

The standard deviation of sample mean in Q3 (d) is 0.0107 and the standard deviation of sample mean
in Q4 (d) is 0.0048.

The Sampling Mean predicted by the Theory of standard distribution for Q3 is: Mean is 8.111008 and
Standard Deviation is 0.0107.

The Sampling Mean predicted by the Theory of standard distribution for Q4 is: Mean is 8.111145 and
Standard Deviation is 0.0048.

As the sample size is greater for q4 then it is a more accurate representation of the population along with
the error being 8.1111-8.111145 = -0.000045 while for q3 it is 8.1111-8.111008 = 0.000092.

Hence q4 is a more accurate representation as the sample size increases the rate of error
decreases.

You might also like