Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Chapter 14

Statistics

Imagine that 500,000 pieces of bolts are being produced every day in a factory. The bolts
must satisfy certain quality. Ideally one should check every bolt before they are used.
However, it would be too expensive to test each one of them. So we have to choose few
bolts (say 50) randomly. The smaller number of bolts (chosen for test) are known as the
samples, which are selected from a larger set of 500,000, known as the population.

14.1 Data analysis: numerical summary of data


As mentioned above, it is practically impossible to estimate the mean µ and variance σ 2 of
a population. However, we can find the mean of a sample {x1 , x2 , · · ·, xn },
n
1 1�
x̄ = (x1 + x2 + · · · + xn ) = xi , (14.1)
n n
i=1

where n is the sample size. Similarly, we can find the variance of a sample as,
n
2 1 �
s = (xi − x̄)2 . (14.2)
n−1
i=1

Note that, sample mean or sample variance need not be equal to population mean or popu-
lation variance. Similarly, k th moment of a sample is,
n
1� k
xk = xi . (14.3)
n
i=1

Note that, the first moment is equal to the mean and the second moment is related to the
variance.
There are few special points, which are very useful for statistical analysis. For example,
median is the point that divides the data in two equal parts, half below the median and
half above. A point below which 25% of the data points of a sample lie is known as the
first quartile or lower quartile (q1 ). A point below which 50% of the data points of a sample
lie is known as the second quartile (q2 ), which is same as the median. A point below which
75% of the data points of a sample lie is known as the third quartile (q3 ). The minimum
and maximum values define the sample range. Interquartile range (IQR) is defined as
q3 − q1 . A data set is given in Table 14.1 and corresponding numerical summary is given in
Table 14.2.

207
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA

105 97 135 163 207 134 218 199 170 176


183 153 174 154 190 70 101 112 149 200
121 120 181 160 194 144 165 145 160 150
180 167 176 158 156 229 158 148 115 88

Table 14.1: Population: 40000 pieces of certain alloy are prepared every day in some fac-
tory. Their strength should meet certain criterion. Practically it is impossible to test each
of the 40000 pieces. So one has to select 40 pieces randomly and test them. Sample: 40
pieces of certain alloy, randomly selected from a population of 40000 pieces. After testing
each sample, their strength values are listed in the table. Numerical summary of the data
is given in Table 14.2.

N x̄ s Min q1 Median (q2 ) q3 Max


40 155.125 35.91 70 134.75 158 177 229

Table 14.2: Numerical summary of the data provided in Table 14.1. Python code to generate
the data summary is given in Figure 14.1. Provided the data set follows a symmetric
probability distribution (like normal distribution), mean and median are nearly equal. In
case of a skewed distributions, mean and median differ from one another (see Exercise).

14.2 Data analysis: graphical representation of data


Graphical representation is a more powerful way of data analysis and visualization than
that of numerical summary. There exist several methods and few selected ones are going
to be discussed below.

14.2.1 Stem-and-leaf diagram


This method does not even need any software. You can do it just using a pen and a paper.
Our discussion will be based on the data given in Table 14.1. Each number is divided
in two parts, a stem, consisting of leading digit(s) and a leaf, consisting of the remaining
digits. The diagram is shown in Table 14.3. The digit in ones place is taken to be the leaf
and remaining digits (tens and/or hundreds place) are taken as stem. The last column
in the diagram displays the leaf (frequency) count associated with each stem. Some key
features of the data become clear from the stem-and-leaf diagram, which were not obvious
from Table 14.1 or Table 14.2. For example, we can see that almost 50% of the data points
lie in a range of 140 to 180. While it is easy to implement this, the output is not very
appealing visually and it becomes increasingly tedious as the number of data points in the
sample becomes large.

208
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA

Stem Leaf Frequency


7 0 1
8 8 1
9 7 1
10 51 2
11 25 2
12 10 2
13 54 2
14 4589 4
15 348680 6
16 73050 5
17 4606 4
18 301 3
19 049 3
20 70 2
21 8 1
22 9 1

Table 14.3: Stem-and-leaf diagram of the data given in Table 14.1.

Figure 14.1: Data summary, histogram and cumulative distribution plot for the data given
in Table 14.1. Mode is the value appearing most often in a set (somewhere between 150 and
170). You can verify this from stem-and-leaf diagram also (Table 14.3). Provided the data
set follows a symmetric distribution (like normal distribution), mean, median and mode are
nearly equal. In case of a skewed distributions, mean, median and mode differ from one
another (see Exercise).

209
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA

IQR=q3-q1=42.25 229 (whisker at largest data


point within q3+1.5 IQR)
q3+1.5 IQR=240.375
q1-1.5 IQR=71.375

q3=177

q2=158

q1=134.75

88 (whisker at smallest data


point within q1-1.5 IQR)

70 (outlier)

Figure 14.2: Box and whisker plot for the data given in Table 14.1. Note that, df.describe()
generates the summary, reported in Table 14.2.

14.2.2 Histogram
Histogram plot of the data given in Table 14.1 is shown in Figure 14.1. Histogram is a
frequency distribution of the data. The range of the data is divided in certain number
of intervals, known as bins. Generally bins of equal width are used.1 In this particular
problem, 8 bins of width equal to 20 are taken, i.e., 70-90, 90-110, 110-130, 130-150,
150-170, 170-190, 190-210, 210-230. There are several ways to represent the histogram.
I have shown two different representations in Figure 14.1; number of observations in each
bin and a normalized plot where the total area under the histogram is equal to 1.

14.2.3 Box and whisker plot


Box and whisker plot is a visual representation of Table 14.2. The plot is shown in Fig-
ure 14.2. The box encloses the IQR, lower edge at the first quartile (q1 ) and upper edge at
the third quartile (q3 ). A line is drawn through the box at the second quartile (q2 ) or median.
A whisker line starts from the each end of the box. The lower whisker starts at the lower
end of the box (q1 ) and extends up to the smallest data point within a range of 1.5IQR, i.e.,
q1 − 1.5IQR. The upper whisker starts at the upper end of the box (q3 ) and extends up to
the largest data point within a range of 1.5IQR, i.e., q3 + 1.5IQR. Data points lying beyond
the whiskers are outliers.

14.2.4 Scatter diagram


Scatter diagrams are generally used to plot multivariate data. It is an excellent way to
identify potential relationship between two variables. When more than two variables exist,
a matrix or scatter diagrams is used to look at all the pairwise relationship. Correlation
coefficient is defined as,2
�n
(yi − y)(xi − x)
rxy = ��n i=1 �n . (14.4)
2 2
i=1 (yi − y) i=1 (xi − x)
1
Although there is no general rule regarding the number of bins, square root of the number of data points
is regarded as a good guess.
2
Compare with the covariance and correlation, as discussed in probability chapter.

210
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA

Figure 14.3: (Top row) Scatter plot matrix. Correlation coefficients are calculated using
Eq 14.4 and illustrated in a correlation heat map (bottom row). Since x1 and x2 are two
randomly generated variables, there is no correlation between them. On the other hand,
x3 and x4 are deliberately defined in such a way that they are linearly related to x1 and x2 ,
respectively.

While rxy = 0 implies no linear relationship, rxy = 1 and rxy = −1 indicates perfect linear
relationship with a positive and negative relationship, respectively. Generally, |rxy | > 0.5
indicates a weak linear dependence, while |rxy | > 0.8 indicates a strong linear dependence.
An example is given in Figure 14.3.

14.2.5 Probability plot


Most of the time we just have the data and we do not know the probability distribution,
which we have to find out. In such situation, we have to assume that the data follows
some probability distribution. Probability plots help us to verify whether the assumed or
hypothesized probability distribution is a reasonable model for the sample data or not.
Our discussion will be focused on normal probability plot. To construct a normal proba-
bility plot, the observations are ranked in an ascending order, as shown in Table 14.4. For
a given data point xi , we calculate the standardized normal scores zi using the following
relations,
(i − 0.5)
= P (Z ≤ zi ) and P (Z ≤ zi ) = Φ(zi ). (14.5)
n
The first equation gives the cumulative probability and the second equation gives the value
of zi from the cumulative standard normal distribution table (see Probability chapter, Fig-
ure 13.16). Finally, the normal probability plot is drawn placing xi and zi along the vertical
and horizontal axis, respectively (Figure 14.4). Plotted points will approximately fall along a
straight line if the data is adequately described by a normal distribution, as hypothesized.
Let us understand the origin of the straight line in the normal probability plot. If we define
our new variable as,
X −x
Z= , (14.6)
s
then X vs. Z plot has to be a straight line by design. Probability distribution of X does not

211
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA

i xi (i − 0.5)/10 = Φ(zi ) zi
1 176 0.05 -1.64
2 180 0.15 -1.04
3 187 0.25 -0.67
4 190 0.35 -0.39
5 193 0.45 -0.13
6 196 0.55 0.13
7 201 0.65 0.39
8 205 0.75 0.67
9 211 0.85 1.04
10 220 0.95 1.64

Table 14.4: Constructing a normal probability plot from a given set of data. Value of zi is
obtained from the cumulative standard normal distribution table (see Probability chapter,
Figure 13.16).

Figure 14.4: Top row: normal probability plot of the data provided in Table 14.4, (cen-
ter) plotted using (xi , zi ) data given in the table and (right) plotted using scipy. Bottom
row: (center) histogram and (right) normal probability plot of randomly generated 500 data
points, according to the normal distribution. In conclusion, if the data set follows a normal
probability distribution, the data points lie along a straight line in the normal probability
plot. Can you guess mean and variance just by looking at the probability plot?

212
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA

Figure 14.5: Normal probability plot of non-normal distributions: (left) heavier-tailed [than
normal] and (right) skewed distribution.

matter in this case. In fact, this is the straight line you see in the normal probability plot.
Now, X vs. Z plotted using the Z values obtained from Eq. 14.5 falls on the straight line
[Eq. 14.6], provided X follows a normal distribution. Otherwise, if the data is not described
by a normal distribution, we should observe some deviation from the straight line [see
Figure 14.5 and Exercise].

14.2.6 Exercise
1. (py) Symmetric data: Normal distribution is an example of a symmetric distribution.
In this case, mean, median and mode are equal. You can use np.random.normal(µ, σ, n)
to create a random normal data set of size n. Take n = 50 (sample) and n = 5000
(population) and do the following.

• Use µ = 0, σ = 1. Generate the summary of the data set, plot the histogram and
box-and-whisker plot. Verify whether mean, median and mode are nearly equal
or not.
• Use µ = 0, σ = 2. Generate the summary of the data set, plot the histogram and
box-and-whisker plot. Verify whether mean, median and mode are nearly equal
or not.

2. (py) Symmetric vs. skewed data: Beta distribution is given by,

Γ(α + β) α−1
f (x) = x (1 − x)β−1 , for 0 < x < 1. (14.7)
Γ(α)Γ(β)

When α = β, we get a symmetric distribution. When α < β, we get a right-skewed dis-


tribution. When α > β, we get a left-skewed distribution. You can use np.random.beta(α, β, n)
to create a random beta data set of size n. Take n = 50 (sample) and n = 5000 (popula-
tion) and do the following.

• Use α = 1, β = 5 to generate a right-skewed data set. Generate the summary of


the data set, plot the histogram and box-and-whisker plot.
• Use α = 5, β = 5 to generate a symmetric data set. Generate the summary of the
data set, plot the histogram and box-and-whisker plot.
• Use α = 5, β = 1 to generate a left-skewed data set. Generate the summary of the
data set, plot the histogram and box-and-whisker plot.
• Although histogram is the easiest way to identify symmetric, left-skewed and
right-skewed data set, also try to distinguish based on the summary of the data
set and the box-and-whisker plot.

3. (py) Right-skewed distribution: Gamma distribution is given by,

β α α−1 −βx
f (x) = x e , for x > 0, α, β > 0. (14.8)
Γ(α)

213
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA

You can use np.random.gamma(α, β, n) to create a random gamma data set of size n.
Take n = 50 (sample) and n = 5000 (population) and do the following. Is the distribution
left skewed or right skewed? Generate the summary of the data set, plot the histogram
and box-and-whisker plot for different α, β values.

4. (py) Right-skewed distribution: Exponential distribution is given by,



αe−αx , for x > 0
f (x) = (14.9)
0, for x ≤ 0

You can use np.random.exponential(α, n) to create a random exponential data set of


size n. Take n = 50 (sample) and n = 5000 (population) and do the following. Is the
distribution left skewed or right skewed? Generate the summary of the data set, plot
the histogram and box-and-whisker plot for different α values.

5. (py) Compare the cumulative histogram plots of a symmetric, left-skewed and right-
skewed data set.

6. (py) Understanding normal probability plot: In this problem we shall learn to draw the
normal probability plot from scratch completely by ourselves and also compare with
the plot generated from scipy.

• Generate 500 normal random variables with µ = 10 and σ = 2 and arrange them
in ascending order. Calculate Z in two different routes: first from Eq. 14.5 [hint:
use stats.norm.ppf()] and next from Eq. 14.6. Plot both the X vs. Z graphs on
top of each other. Compare with the normal probability plot obtained from scipy
using stats.probplot().
• Generate 500 beta random variables with α = 10 and β = 2 and arrange them in
ascending order. Calculate Z in two different routes: first from Eq. 14.5 [hint:
use stats.norm.ppf()] and next from Eq. 14.6. Plot both the X vs. Z graphs on
top of each other. Compare with the normal probability plot obtained from scipy
using stats.probplot().

7. (py) Heavier-tailed (than normal) distribution: If the probability density function de-
cays slowly than compared to the normal distribution, we call it a heavier-tailed (than
normal) distribution. The name originates from the fact that there exist more data
points near the tails, than that of a normal distribution.

• A Beta-distribution with α = β = 1 is an extreme example of heavy-tailed distri-


bution. Draw the normal probability plot.
• Increase the value of α and β gradually and notice what happens.
• With the help of box-and-whisker plot, verify that heavy-tailed distribution has
fewer outliers.

8. (py) Normal probability plot of skewed distributions: Using a Beta-distribution (as


shown above), create left-skewed and right-skewed data sets containing n = 50 (sam-
ple) and n = 5000 (population) data points. Compare the normal probability plots of
the left-skewed and right-skewed data set.

9. (py) Normal probability plot of skewed distributions: Using an exponential-distribution


(as shown above), generate a data set containing n = 50 (sample) and n = 5000 (popu-
lation) data points. Draw the normal probability plot of the generated data set.

10. (py) Slope and intercept of the straight line in normal probability plot: Generate two
different random variables with µ = 15, σ = 4 and µ = 10, σ = 2. Plot both of them on
the same normal probability plot and compare.

214
14.3. POINT ESTIMATION OF PARAMETERS

Figure 14.6: An example of random sampling (X1 , X2 , ....., X200 , each sample containing 10
data points) and sampling distribution of sample averages x1 , x2 , ....., x200 , which follows a
normal distribution, according to the central limit theorem.

11. (py) Scatter plot and correlation heat map: Generate two random variables x1 and
x2 , as shown in Figure 14.3. Define x3 = x1 ∗ x1 + np.random.normal(0, 0.2, 2500) and
x4 = x2 ∗ x2 ∗ x2 + np.random.normal(0, 0.2, 2500). Generate the scatter plot matrix and
correlation heat map.

14.3 Point estimation of parameters


In the beginning of the chapter, we discussed about the population and sample. In reality,
a population is so large that there is no point trying to gather data about the entire popula-
tion. However, we have some limited amount of data available to us, known as the sample.
Our task is to find the population parameters (like µ and σ for a normal distribution, λ for
an exponential distribution), based on the sample data.

14.3.1 Random sampling, Samplig distribution, Central limit theorem


Imagine that we somehow know that a population follows certain probability distribution.3
For example, if the population follows a normal distribution, then we would like to know the
two important parameters: mean (µ) and variance (σ 2 ). A point estimate of any parameter
of a probability distribution 4 is a number which is calculated from a given sample of
very small size compared to the population. Thus, a point estimate is an approximation of
the unknown exact value of a parameter of the underlying probability distribution of the
population. Can we simply calculate the sample average and take it to be equal to the
parameter µ (population mean) of a normal distribution? Let us do an in silico experiment.
Say I select a sample X1 containing n data points from a population and the sample
average turns out to be x1 . Someone else selects another sample X2 containing n data
3
We may know it from the underlying physics. Or we may guess it from the sample.
4
For example µ and σ 2 of a normal distribution, λ of an exponential distribution.

215
14.3. POINT ESTIMATION OF PARAMETERS

points form the same population and the sample average turns out to be x2 . This is known
as random sampling. Because of random sampling, the sample average will vary and we get
a sampling distribution. Let us write a python code to do this exercise. First, we generate
p (p = nsample in the code shown in Figure 14.6) different samples X1 , X2 , ......, Xp ,5 each
sample containing n = 10 number of data points. Each sample is drawn from the same
population (µ = 10, σ = 2). We also calculate the sample average or sample mean for each
sample.

X1 = {x1 , x2 , ....., xn }1 ⇒ x1 ,
X2 = {x1 , x2 , ....., xn }2 ⇒ x2 ,
..............................................
Xp = {x1 , x2 , ....., xn }p ⇒ xp .

Do the following exercise thoroughly to to understand the concept.

• Sample average values x1 , x2 , ....., xp are different for every sample, although all of them
are drawn from the same population (µ = 10, σ = 2). Thus, we can define a new
random variable X = {x1 , x2 , · · ·, xp }, which is the sampling distribution of average
values. Modify the code [Figure 14.6] and plot the histogram of X. Do you see that
some of the sample averages differ significantly from the population mean?

• Similarly, we can define another random variable S 2 = {s21 , s22 , ···, s2p }, which is the sam-
pling distribution of variance. Modify the code [Figure 14.6] and plot the histogram of
S2.

• Modify the code [Figure 14.6] and calculate the expectation value of X and S 2 and
verify the following,
E(X) = µ, E(S 2 ) = σ 2 . (14.10)

The last equation is something you would expect intuitively. You can find the mathematical
proof in any standard statistics text book.
Thus, it is clear that X = {x1 , x2 , ....., xp } is a random variable. We can further verify that
X is a normal random variable with mean µ and variance σ 2 /n. If the claim is true, we can
define a standard normal variable,
X −µ
Z= √ . (14.11)
σ/ n
Using the code [Figure 14.6] we can verify that Z has a mean of 0 and variance of 1. The
above equation is known as the central limit theorem.6 In words, given a random sample X
of size n taken from a population with mean µ and variance σ 2 , the sample average X is a
normal random variable, with mean and variance equal to,

2 σ2
µX = µ, σX = . (14.12)
n
Remember that we started with a population following the normal distribution. Will
the central limit theorem hold good if the population follows some other distribution? It
turns out that, if the sample size is large enough (n > 30), central limit theorem holds good
irrespective of the probability distribution followed by the population. This implies that X
can have any probability distribution, X is going to follow normal distribution if n is large
enough [see exercise]. If X follows a normal distribution, then central limit theorem holds
good even for a small sample size (n ≈ 5).
5
These are independent random variables and each one has same probability distribution.
6
We shall not attempt to prove central limit theorem, but just be happy with the “numerical proof”.

216
14.3. POINT ESTIMATION OF PARAMETERS

Example: Machine parts are manufactured having a mean length of 100 mm and stan-
dard deviation of 10 mm. What is the probability that a random sample of size 25 have a
sample average greater than 105 mm?
2 = 4, the standardized value of X = 105 is Z = 2.5, and thus, P (X > 105) =
Since σX
1 − P (Z < 2.5) = 1 − 0.9938 = 0.0062. In conclusion, such a large deviation of sample average
from population mean is a very rare event. One can further verify that P (100 < X < 101) =
, P (102 < X < 103) =, P (104 < X < 105) = etc., which clearly shows that the probability of
getting a sample average differing significantly from the population mean is very small.

14.3.2 Point estimation of mean and variance


Based on the above example, it is clear that sample average is a reasonable point estimate
of population mean,7 i.e., µ̂ = x, where x is calculated from a sample using Eq. 14.1.
Similarly The point estimate of variance is σ̂ 2 = s2 , where s2 is calculated from a sample
using Eq. 14.2. The point estimate of the k th moment is m̂k = xk , where xk is calculated
from a sample using Eq. 14.3. Note the use of µ̂ or σ̂ 2 , which signifies that these are not
population µ or σ 2 , but point estimates of mean and variance, calculated from a sample.
This can be extended to other quantities, such as p = proportion of female in a population.
The sample proportion is p̂ = x/n, where x is the number of females in a random sample of
size n.

14.3.3 Standard error


We already known from the central limit theorem that sampling from a normal distribution
with mean µ and variance σ 2 ensures sample mean to be a normal random variable X with

mean µ and variance σ 2 /n. The standard error of X is given by, σX = σ/ n. In case we do

not know σ, we have to calculate a point estimate of the standard error using σ̂X = s/ n.

Example: Using the data given in Table 14.1, let us calculate the sample mean and stan-
dard error of the sample mean.
Point estimate of mean (sample mean) is µ̂ = x = 155.125. Standard error of the sample

mean is σX = σ/ n, where n = 40. Since we do not know σ, we use sample standard

deviation s = 35.91, such that the point estimate of the standard error is σ̂X = s/ n = 5.68.
Note that, standard error is reasonably small (∼ 3% of the sample mean). It is safe to
assume that the true mean lies within a range of 155.125 ± 2 × 5.68.

14.3.4 Bootstrap standard error


14.3.5 Exercise
1. (py) Repeat the exercise shown in Figure 14.6 and verify whether central limit theorem
works if the population follows a lognormal distribution.

2. (py) Repeat the exercise shown in Figure 14.6 and verify whether central limit theorem
works if the population follows an exponential distribution.

3. A random variable X has a continuous uniform probability distribution, f (x) = 12


for 4 ≤ x ≤ 6 and f (x) = 0 otherwise. If we take a random sample of size n = 30,
what would be the probability distribution of the sample mean. Draw the probability
distribution of X and X.

4. Machine parts are manufactured having a mean length of 100 mm and standard
deviation of 10 mm. Compare the probability that a random sample have a sample
average greater than 105 mm for different sample size n = 5, 10, 15, 20, 25, 30.
7
Provided the sample size is reasonably large.

217
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL

5. What is the difference between standard deviation and standard error?

14.4 Interval estimation of parameters: confidence interval


For a given random sample X = {x1 , x2 , ......, xn }, it is easy to calculate sample mean x̄
[Eq. 14.1] and sample variance s2 [Eq. 14.2]. Using the simplest method of point estimation,
we can take µ̂ = x̄ and σ̂ = s and assume that population mean µ ≈ µ̂, and population
variance σ ≈ σ̂. However, this is not reliable because we have no idea how far µ̂ and σ̂ are
from the population mean µ and σ. Thus, although point estimation of parameters is very
simple, they are not very reliable for the purpose of decision making.
From a given random sample X = {x1 , x2 , ......, xn }, we would like to estimate some un-
known parameter θ (e.g. θ = µ or θ = σ) of some distribution. However, we are going to
use interval estimation, which is a better choice than point estimation. We have to find an
interval θl ≤ θ ≤ θu , which contains θ, not with certainty, but with a very high probability α,
i.e.,
P (θl ≤ θ ≤ θu ) = α.
Popular choice of α is 0.95 or even higher like 0.99. Very soon we shall discover that θl and
θu are not constant. If we take a different sample, θl and θu will change. In other words, Θl
and Θu are random variables, satisfying,

P (Θl ≤ θ ≤ Θu ) = α. (14.13)

For p different random samples X1 , X2 , ....., Xp , we shall get p−pairs of values (θl , θu )1 , (θl , θu )2 , ....., (θl , θu )p
Let us try to understand what it means with some specific example.

14.4.1 Confidence interval for µ of normal distribution, σ 2 known


Let us understand the concept for a relatively simple case and then attempt to apply it to a
more general case. Assume that we know the population variance σ 2 of a normal random
variable and want to construct a confidence interval for the unknown parameter µ, i.e., we
want to construct a confidence interval for the population mean (µl ≤ µ ≤ µu ) from a given
sample.
This is exactly the reason why we are trying to learn all these statistical methods. Be-
cause of finite number of data points, the sample average differ from one sample to another,
although all the samples have same population mean. For a given sample, we have no idea
how close the sample average is to the population mean. It is not possible to exactly answer
the question. All we can do is to predict some interval which holds population mean µ not
with certainty, but with a very high probability.
As discussed in the previous section, sample average X is a random variable and we
can define a standard normal variable,

X −µ
Z= √ . (14.14)
σ/ n

Once we have defined a standard random variable Z, now we can define an interval (−cα , cα )
such that [Figure 14.8],
� �
X −µ
P (−cα ≤ Z ≤ cα ) = P −cα ≤ √ ≤ cα = α, where 0 ≤ α ≤ 1.
σ/ n

For a given α, we know the value of cα [Figure 14.8]. From the above equation, we can
write,
X −µ
− cα ≤ √ ≤ cα ,
σ/ n

218
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL

Figure 14.7: Constructing the confidence interval for µ repeatedly. The dots at the center
of the intervals represent the sample average x. Some of the intervals do not contain
population mean µ. The interval width shortens if we have more data points in a sample.

1-α/2 1-α/2

-cα cα

Figure 14.8: Choice of cα for different confidence intervals. For example, for α = 0.9, 0.95
and 0.99, cα = 1.65, 1.96 and 2.58, respectively.

which further simplifies to,


cα σ cα σ
X− √ ≤µ≤X+ √ . (14.15)
n n
� �
c√
ασ c√
ασ
Thus, the unknown population parameter µ lies in an interval X − n
,X + n
, not with
certainty, but with probability of α.

Example: Given a sample

X1 = {11.39, 7.46, 13.78, 11.98, 8.06, 11.69, 9.86, 12.02, 9.12, 11.73} ,

x1 = 10.71. Thus, if we want a 95% confidence interval, cα = 1.96 and σ = 2, n = 10, such
that the confidence interval is 9.47 ≤ µ ≤ 11.95. If we take another sample X2 , we get some
other interval because X changes. Several such intervals can be constructed for different
samples, as shown in Figure 14.7.

Interpretation of confidence interval: A 95% confidence interval 9.47 ≤ µ ≤ 11.95 does


not imply that population mean µ is within the interval with a probability of 0.95. As shown
in Figure 14.7, µ either lies within some interval (probability 1) or µ does not lie within some
interval (probability 0). Clearly, confidence interval is a random interval and if we construct

219
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL

E=|x μ |

x-cασ/√n x μ x+cασ/√n

Figure 14.9: Error in estimating µ. Note that E ≤ cα σ/ n.

many such intervals, 95% of them will contain the population mean µ.

Precision of the confidence interval and choice of sample size: Width of a confidence
interval is given by,
cα σ 3.92σ 5.16σ
2√ = √ for α = 0.95, and √ for α = 0.99.
n n n

Thus, to get a higher level of confidence, the interval width must be longer. Length of a
confidence interval is a measure of the precision of estimation. Thus, precision is inversely
proportional to the confidence level. For the purpose of decision making, it is desirable
to get adequate confidence level, but the interval should be short enough at the same
time. One way of obtaining this is to increase the number of data points n in a sample [
Figure 14.7].
We can even estimate the desirable sample size for a specified error. If we are using

x to estimate population mean µ, the error E = |x − µ| is less than or equal to cα σ/ n
[Figure 14.9]. Thus, for a specified E, we can find the appropriate sample size to be,
� c σ �2
α
n= . (14.16)
E
In words, if we want to use x as a point estimate of µ, we can be 100α% confident that the
error |x − µ| will not exceed a specified amount E when the sample size is given by the above
equation. Note that, if the right hand side of the above equation is not an integer, it needs
to be rounder up.

Example: In the previous example, 9.47 ≤ µ ≤ 11.95, E = (11.95 − 9.47)/2 = 1.24, cα = 1.96
(for 95% confidence interval), σ = 2, such that n = 10. If we want E = 0.555,
� �
1.96 × 2 2
n= ≈ 50.
0.555

See Figure 14.7 to understand the difference between n = 10 and n = 50. Do you realize
that while a sample of bigger size helps to improve the precision, it does not enhance the
confidence level.

14.4.2 Confidence interval for µ of normal distribution, σ 2 unknown


So far we have worked with known population variance σ 2 . However, in reality popula-
tion variation is also unknown to us. Can we use sample standard deviation and rewrite
Eq. 14.15 as,
cα S cα S
X− √ ≤µ≤X+ √ .
n n
Note that, both X and S are random variables, because their values depend on the given
sample. As shown in Figure 14.10, this works if the sample size is large enough. Otherwise,
for a small sample, replacing σ with S is not acceptable.

220
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL

n=10

n=50

Figure 14.10: Replacing sample variance S in place of population variance σ in Eq. 14.15
works if the sample size is large enough.

Student’s t−distribution: 8 In case of a small sample size, the random variable,

X −µ
T = √ (14.17)
S/ n

has a t−distribution with n − 1 degrees of freedom. The t−probability distribution function


is,
� �−(k+1)/2
Γ[(k + 1)/2] x2
f (x) = √ +1 − ∞ < x < ∞, (14.18)
πkΓ(k/2) k
where the parameter k is the number of degrees of freedom. The mean and variance of
t−distribution is 0 and k/(k − 2) (for k > 2), respectively. Some examples are illustrated
in Figure 14.11. Although t−distribution looks very similar to a standard normal distri-
bution, the former has heavier tails than the latter. That means, t−distribution has more
probability at the tails than the standard normal distribution.
For a given value of α, tα (known as percentage points of t− distribution) values are
tabulated for different values of k [Figure 14.11], such that
� tα
P (X ≤ tα ) = f (x)dx.
−∞

For example, tα = 2.228 for k = 10 and α = 0.975 implies that area under the curve up to tα
is P (X ≤ tα ) = 0.975. The shaded region (beyond tα ) has an area 1 − α = 0.025. Note that,
t−distribution is identical with standard normal distribution at k = ∞.
For constructing the confidence interval, we need to decide the value of tα . For this pur-
8
After English statistician William Sealy Gosset, known as student.

221
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL

1-α 0.025 0.010 0.005 1-α 0.025 0.010 0.005


k k
4 2.776 3.747 4.604 4 2.776 3.747 4.604
6 2.447 3.143 3.707 6 2.447 3.143 3.707
8 2.306 2.896 3.355 8 2.306 2.896 3.355
10 2.228 2.764 3.169 10 2.228 2.764 3.169
20 2.086 2.528 2.845 20 2.086 2.528 2.845
30 2.042 2.457 2.750 30 2.042 2.457 2.750
120 1.980 2.358 2.617 120 1.980 2.358 2.617
∞ 1.960 2.326 2.576 ∞ 1.960 2.326 2.576 α
95% 98% 99%
CI CI CI

1-α
1-α/2 1-α/2

-tα tα

Figure 14.11: (left) Probability distribution function of a t−distribution for different param-
eters. It is identical to a standard normal distribution at k = ∞. (center) Percentage points
of t−distribution, denoted by tα and tabulated for different k and α values. (right) Choice
of tα for different confidence intervals. The last row (k = ∞) is identical to the standard
normal distribution.

Figure 14.12: A skewed beta distribution (parameters α = 1 and β = 5): histogram and
probability plot of a sample of size n = 50.

pose, we use the symmetry of t−distribution. For example, to construct a 95% confidence
interval, we select the values given in 0.025 column. Once we have select a suitable value
of tα , the confidence interval is given by,

tα S tα S
X− √ ≤µ≤X+ √ , (14.19)
n n

where X, S are calculated from a given sample, containing n data points. Note that, unlike
the case of known population variance, we can not define interval width, because S varies
from one sample to another, which adds more randomness to the whole process. From
Figure 14.11, it is clear that the constant tα is greater than the corresponding cα for a given
confidence interval. This provides some extra cushion, hoping to take care of the added
randomness. However, if the sample size is large (n ≈ 50), we can even use cα in place of tα .
However, for smaller sample size, it is safer to use t−distribution when population variance
is unknown.

14.4.3 Central limit theorem: confidence interval for non-normal distribu-


tion
So far we have assumed normal distribution and calculated the confidence intervals. Can
we apply the same trick for any general distribution? Let us use the same method we used

222
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL

Sample size=10 Sample size=10


Sample number=200 Sample number=200

Sample size=50 Sample size=50


Sample number=200 Sample number=200

Figure 14.13: Sample average x1 , x2 , ....., x200 of independent non-normal random variables
X1 , X2 , ....., X200 follow normal distribution provided sample size is large enough. Each sam-
ple is randomly generated according to the skewed beta distribution shown in Figure 14.12.

before,
X −µ
Z= √ , (14.20)
S/ n
where X, S are sample average and sample standard deviation, µ is the population mean
and n is the sample size. If Z if found to be a standard normal variable, then all the tricks
we have learned so far can be applied.
Let us check using a beta distribution with parameters α = 1, β = 5 and it has a pop-
ulation mean equal to α/(α + β). From the histogram and normal probability plot [Fig-
ure 14.12], it is obvious that we have a right-skewed distribution. Now we can define a
random variable Z using Eq. 14.20, as illustrated in Figure 14.13. I have compared two
cases: first in which n = 10 points are used to calculate the sample average and sample
standard deviations and second in which n = 50 points are used to calculate the sample
average and sample standard deviation. From the histograms and normal probability plots,
it is obvious that for n = 10, Z is not a standard normal variable, while for n = 50, Z is a
standard normal variable. In conclusion, we can define a standard normal variable using
Eq. 14.20 for any probability distribution, provided the sample size is large enough. Re-
member that, if it is a normal distribution, the same method works even for a sample size
as small as n = 10. For a non-normal distribution, all we need is a larger sample n ≈ 50.
This is amazing that sample average of any general independent random variables fol-
low a normal distribution when sample size is large. This is related to one of the most
fundamental results in probability theory, the central limit theorem. Once it is known
that X follows a normal distribution, we follow same procedure to construct the confidence
interval,
cα S cα S
X− √ ≤µ≤X+ √ ,
n n
where X and S are calculated from a sample of size n. A summary of confidence interval
construction is given in Table 14.5.

223
14.5. HYPOTHESIS TESTING: DECISION-MAKING

Normal population Normal population Non-normal population


Known σ Unknown σ Unknown σ
Sample X = {x1 , x2 , ..., xn } Sample X = {x1 , x2 , ..., xn } Sample X = {x1 , x2 , ..., xn }
Calculate X Calculate X, S Calculate X, S
X−µ X−µ X−µ
Z= √
σ/ n
T = √
S/ n
Z= √
S/ n

α
α α

1-α/2 1-α/2 1-α/2 1-α/2 1-α/2 1-α/2

-cα cα -tα tα -cα cα

Std. normal distribution T-distribution Std. normal distribution


Decide cα for α% CI Decide tα for α% CI Decide cα for α% CI
X − c√αnσ ≤ µ ≤ X + c√αnσ X − t√αnS ≤ µ ≤ X + t√αnS X − c√αnS ≤ µ ≤ X + c√αnS

Table 14.5: Construction of confidence interval (CI) for µ, where the sample comes from
a normal and non-normal population. If the population is normal, the method works for
small number of data points (n ≈ 10) in a sample. In the population is non-normal, we need
a sample with more data points (n ≈ 50).

14.4.4 Exercise
1. The length of a rod is a normal random variable distributed with σ = 1 cm. Ten
measurements are as follows: 50.7, 50.5, 50.6, 50.5, 50.3, 50.6, 50.8, 50.2, 50.3,
50.1. Find a 95% CI for µ, the mean length.
2. Repeat the same problem, this time for 99% CI.
3. From the above two problems, verify that the precision level for 95% CI is higher than
that of 99% CI. What would you do to improve the precision level of 99% CI?
4. Plot n vs. E (absolute error) for 95% CI and 99% CI on a paper.
5. Plot n vs. E (absolute error) for 95% CI with σ = 2 and σ = 4 on a paper.
6. (py) Plot n vs. E (absolute error) for 95% CI and 99% CI using python. Assume σ = 2.
7. (py) Plot n vs. E (absolute error) for 95% CI with σ = 2 and σ = 4 using python.
8. Based on the plots, what would you conclude?

14.5 Hypothesis testing: decision-making


We have learned to estimate a population parameter (like mean) from sample data, using
either a point estimate or an interval estimate. There can be a situation when there are two
competing claims about the numerical value of a parameter. For example, a seller claims
that the strength of an alloy is 100 MPa. How do we verify the seller’s claim that strength
is really 100 MPa and not more/less than 100 MPa? Remember that, all we can do is to
take a random sample (say 10 specimens), measure the strength of each specimen and
calculate sample average. If the sample average is less than the population mean 100 MPa,
how do we know whether it is because of random nature of sample average9 or because of
the inferior quality (population mean itself is less than 100 MPa) of the alloy.
9
Select another set of 10 specimens and get another value for sample average.

224

You might also like