Estimation of The Population Average

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Mindanao State University

General Santos City

December 28, 2020

Carlito O. Daarol
Instructor
Descriptive Statistics and Statistical Inference

Note: This is your reference for question number 1 in the final exam

Estimation of the population average


Case 1: The population is given with parameter µ.

Example1: Population consists of 47,558 students whose average body mass index
(BMI) is µbmi = 19.32374 (which is considered to be healthy status)

Using random sampling, repeated random sampling and Central Limit Theorem

Actual value of µbmi is 19.32374


A point estimate of µbmi is 19.32907
And a 95% Confidence interval for µbmi is the interval (19.28632, 19.3717)
(Complete R code is needed to get this results)

Example2: The same population whose average age is µage = 26.49931 years old
Using random sampling, repeated random sampling and Central Limit Theorem

Actual value of µage is 26.49931


A point estimate of µage is 26.48184 years old
And a 95% Confidence interval for µage is the interval (26.3384, 26.6282)
(Complete R code is needed to get this results)

Case 2: The population is completely unknown with parameter µ. We are only given a set of random
samples.

Random samples = {28, -44, 29, 30, 26, 27, 22, 23, 33, 16, 24, 29, 24, 40 , 21, 31, 34, -2, 25, 19}

Using random sampling, repeated random sampling and Central Limit Theorem

Actual mean value of µ: NA (Unknown Population)


Point Estimate for the population mean µ is 21.74553
95% Confidence Interval for the mean µ is (15.60, 30.30)
(Complete bootstrapping code is needed to get these results)

Remark: The variable being measured is the speed of light. The data was collected
sometime in the 18th century.

How it is done? (Bootstrap process)

Step 1: Get a random sample of size n

RandomSample = {28, -44, 29, 30, 26, 27, 22, 23, 33, 16, 24, 29, 24, 40 , 21, 31, 34, -2, 25, 19}

Step 2: Get a random sample of size n from the original set of random samples (with replacement). This
process is called resampling and repeat this process for so many times (say 20,000 times). This
process will generate the following sets of random samples

Set1: 21 25 40 29 16 -2 24 26 19 40 26 25 33 29 23 22 16 33 25 30
Set2: 40 34 24 22 29 21 16 24 22 33 33 16 22 27 -44 26 23 29 24 -2
Set3: 28 27 21 33 21 31 19 27 24 23 22 31 34 -2 34 -44 30 24 26
Set4: 19 40 29 23 31 29 40 29 40 22 29 21 26 23 25 16 -2 16 29 -44
Set5: 27 31 24 30 29 40 25 22 33 22 -44 31 24 25 19 21 22 30 28 2321


Set20000: 31 21 -44 25 21 31 34 24 27 25 19 34 28 19 26 23 34 26 27 16

The sets Set1, Set2, Set3, … Set20000are called as bootstrap random samples which are obtained by
repeated resampling using the original set as the source.

Step3: For each bootstrap random samples compute the mean value. This will give us a sequence of
mean values

mean1 = mean of Set1 = 25.00


mean2 = mean of Set2 = 20.95
mean3 = mean of Set3 = 21.70
mean4 = mean of Set4 = 22.05
mean5 = mean of Set5 = 27.40
mean6 = mean of Set6 = 23.10


Mean20000 = mean of Set20000 = 22.35

The collection of sample means: 25.00, 20.95, 21.70, 22.05, 27.40, 23.10, … 20.85, 22.35) forms
what is called a distribution of sampling means. The mean of these sampling means is the
estimate of the true population mean and this result is guaranteed by the Central Limit
Theorem.
R command to generate Set 1 and mean1
Set1 <- sample(random_samples,length(random_samples),replace = TRUE)
Set1
mean(Set1)

R for loop to repeat resamples for B = 20,000 times

Set.seed(123) # I want your results to coincide exactly with my results


Set <- list()
Means <- NULL
Random_samples = {28, -44, 29, 30, 26, 27, 22, 23, 33, 16, 24, 29, 24, 40 , 21, 31, 34, -2, 25, 19}

For (i in 1:20000){
Set[[i]] <- sample(random_samples,length(random_samples),replace = TRUE)
Set[[i]]
Means[i] <- mean(Set[[i]]) # compute the mean of the resampled set
}

The for loop generates


1) a collection of resampled sets Set[[i] where I = 1,2,3,…. B (where B = 20,000)
2) a collection of sampling means (one sample mean computed for each set[i])

Lastly, the mean of the sampling means is the estimate of the population parameter.

Bootstrap process in one picture

The normal curve on the right side represents the distribution of the sampling means.
The sample means are computed from the bootstrap samples.
The mean of the sampling means is the estimate of the true unknown population parameter 𝜽.
Step4: Invoke the Central Limit Theorem

General Idea: Regardless of the population distribution model, as the sample size
increases, the sample mean tends to be normally distributed around the population mean, and
its standard deviation shrinks as n increases.

In symbol this means

Application of Central Limit Theorem

CLT can be applied in two ways


1. From a collection of sets of random samples, you can estimate the population parameters.
An example is the estimation of population parameters as discussed above

2. From the given population parameters, you can derive the sample properties

Example for the second application of CLT:

It is believed that nearsightedness affects about 8% of all children. 194 incoming children have their
eyesight tested. Can the CLT be used in this situation?
Answer is yes!

1. Nearsightedness is a two level variable (meaning it is a binomial random variable). It is either


you are nearsighted or not.

2. For a binomial random variable X, the mean of X is equal to Np and the variance of X is Npq

3. Proportion of nearsightedness p = 8% = 0.08 and population size in N = 194. This means for the
entire population, the average number of nearsighted students is 0.08(194) = 15.52 ≈ 16.

By CLT, the mean 𝑋̅ 𝑠ℎ𝑜𝑢𝑙𝑑 𝑏𝑒 𝑛𝑜𝑟𝑚𝑎𝑙𝑙𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑 𝑤𝑖𝑡ℎ a mean of Np = 194(0.08) = 15.52
and standard deviation = √𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = √𝑁𝑝𝑞 = √194(0.08)(0.92) = 3.7787.

4. With 194 incoming children, what is a reasonable range of nearsighted children the school can
expect?

We are going to use the 68-95-99.7 normality rule here. A reasonable estimate is to cover the
99.7% coverage of the data which falls within 3 standard deviations.

3 standard deviations is 3SD = 3(3.7787) = 11.3361.


This within 3SD confidence interval is (15.52 - 11.361, 5.52 + 11.361) = (4.1839, 26.881)
The school should expect between 4 and 27 nearsighted children out of 194 incoming children.

Example of Central Limit Theorem in Accounting

Central Limit Theorem - Overview, History, and Example (corporatefinanceinstitute.com)

An investor is interested in estimating the return of ABC stock market index


that is comprised of 100,000 stocks. Due to the large size of the index, the
investor is unable to analyze each stock independently and instead chooses to
use random sampling to get an estimate of the overall return of the index.

The investor picks random samples of the stocks, with each sample comprising
at least 30 stocks. The samples must be random, and any previously selected
samples must be replaced in subsequent samples to avoid bias.

If the first sample produces an average return of 7.5%, the next sample may
produce an average return of 7.8%. With the nature of randomized sampling,
each sample will produce a different result. As you increase the size of the
sample size with each sample you pick, the sample means will start forming
their own distributions.

The distribution of the sample means will move toward normal as the value of
n increases. The average return of the stocks in the sample index estimates the
return of the whole index of 100,000 stocks, and the average return is normally
distributed.

History of the Central Limit Theorem

The initial version of the central limit theorem was coined by Abraham De
Moivre, a French-born mathematician. In an article published in 1733, De
Moivre used the normal distribution to find the number of heads resulting
from multiple tosses of a coin. The concept was unpopular at the time, and it
was forgotten quickly.

However, in 1812, the concept was reintroduced by Pierre-Simon Laplace,


another famous French mathematician. Laplace re-introduced the normal
distribution concept in his work titled “Théorie Analytique des Probabilités,”
where he attempted to approximate binomial distribution with the normal
distribution.

The mathematician found that the average of independent random variables,


when increased in number, tend to follow a normal distribution. At that time,
Laplace’s findings on the central limit theorem attracted attention from other
theorists and academicians.

Later in 1901, the central limit theorem was expanded by Aleksandr Lyapunov,
a Russian mathematician. Lyapunov went a step ahead to define the concept
in general terms and prove how the concept worked mathematically. The
characteristic functions that he used to provide the theorem were adopted in
modern probability theory.
What is the Central Limit Theorem (CLT)?
The Central Limit Theorem (CLT) is a statistical concept that states that the
sample mean distribution of a random variable will assume a near-normal or
normal distribution if the sample size is large enough. In simple terms, the
theorem states that the sampling distribution of the mean approaches a
normal distribution as the size of the sample increases, regardless of the shape
of the original population distribution.

As the user increases the number of samples to 30, 40, 50, etc., the graph of
the sample means will move towards a normal distribution. The sample size
must be 30 or higher for the central limit theorem to hold.

One of the most important components of the theorem is that the mean of
the sample will be the mean of the entire population. If you calculate the
mean of multiple samples of the population, add them up, and find their
average, the result will be the estimate of the population mean.
The same applies when using standard deviation. If you calculate the standard
deviation of all the samples in the population, add them up, and find the
average, the result will be the standard deviation of the entire population.

How Does the Central Limit Theorem Work?

The central limit theorem forms the basis of the probability distribution. It
makes it easy to understand how population estimates behave when
subjected to repeated sampling. When plotted on a graph, the theorem shows
the shape of the distribution formed by the means of repeated population
samples.

As the sample sizes get bigger, the distribution of the means from the
repeated samples tend to normalize and resemble a normal distribution. The
result remains the same regardless of what the original shape of the
distribution was. It can be illustrated in the figure below:
From the figure above, we can deduce that despite the fact that the original
shape of the distribution was uniform, it tends towards a normal distribution
as the value of n (sample size) increases.

Apart from showing the shape that the sample means will take, the central
limit theorem also gives an overview of the mean and variance of the
distribution. The sample mean of the distribution is the actual population
mean from which the samples were taken from.

The variance of the sample distribution, on the other hand, is the variance of
the population divided by n. Therefore, the larger the sample size of the
distribution, the smaller the variance of the sample mean.

You might also like