Professional Documents
Culture Documents
Statistics
Statistics
Statistics
Imagine that 500,000 pieces of bolts are being produced every day in a factory. The bolts
must satisfy certain quality. Ideally one should check every bolt before they are used.
However, it would be too expensive to test each one of them. So we have to choose few
bolts (say 50) randomly. The smaller number of bolts (chosen for test) are known as the
samples, which are selected from a larger set of 500,000, known as the population.
where n is the sample size. Similarly, we can find the variance of a sample as,
n
2 1 �
s = (xi − x̄)2 . (14.2)
n−1
i=1
Note that, sample mean or sample variance need not be equal to population mean or popu-
lation variance. Similarly, k th moment of a sample is,
n
1� k
xk = xi . (14.3)
n
i=1
Note that, the first moment is equal to the mean and the second moment is related to the
variance.
There are few special points, which are very useful for statistical analysis. For example,
median is the point that divides the data in two equal parts, half below the median and
half above. A point below which 25% of the data points of a sample lie is known as the
first quartile or lower quartile (q1 ). A point below which 50% of the data points of a sample
lie is known as the second quartile (q2 ), which is same as the median. A point below which
75% of the data points of a sample lie is known as the third quartile (q3 ). The minimum
and maximum values define the sample range. Interquartile range (IQR) is defined as
q3 − q1 . A data set is given in Table 14.1 and corresponding numerical summary is given in
Table 14.2.
207
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA
Table 14.1: Population: 40000 pieces of certain alloy are prepared every day in some fac-
tory. Their strength should meet certain criterion. Practically it is impossible to test each
of the 40000 pieces. So one has to select 40 pieces randomly and test them. Sample: 40
pieces of certain alloy, randomly selected from a population of 40000 pieces. After testing
each sample, their strength values are listed in the table. Numerical summary of the data
is given in Table 14.2.
Table 14.2: Numerical summary of the data provided in Table 14.1. Python code to generate
the data summary is given in Figure 14.1. Provided the data set follows a symmetric
probability distribution (like normal distribution), mean and median are nearly equal. In
case of a skewed distributions, mean and median differ from one another (see Exercise).
208
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA
Figure 14.1: Data summary, histogram and cumulative distribution plot for the data given
in Table 14.1. Mode is the value appearing most often in a set (somewhere between 150 and
170). You can verify this from stem-and-leaf diagram also (Table 14.3). Provided the data
set follows a symmetric distribution (like normal distribution), mean, median and mode are
nearly equal. In case of a skewed distributions, mean, median and mode differ from one
another (see Exercise).
209
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA
q3=177
q2=158
q1=134.75
70 (outlier)
Figure 14.2: Box and whisker plot for the data given in Table 14.1. Note that, df.describe()
generates the summary, reported in Table 14.2.
14.2.2 Histogram
Histogram plot of the data given in Table 14.1 is shown in Figure 14.1. Histogram is a
frequency distribution of the data. The range of the data is divided in certain number
of intervals, known as bins. Generally bins of equal width are used.1 In this particular
problem, 8 bins of width equal to 20 are taken, i.e., 70-90, 90-110, 110-130, 130-150,
150-170, 170-190, 190-210, 210-230. There are several ways to represent the histogram.
I have shown two different representations in Figure 14.1; number of observations in each
bin and a normalized plot where the total area under the histogram is equal to 1.
210
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA
Figure 14.3: (Top row) Scatter plot matrix. Correlation coefficients are calculated using
Eq 14.4 and illustrated in a correlation heat map (bottom row). Since x1 and x2 are two
randomly generated variables, there is no correlation between them. On the other hand,
x3 and x4 are deliberately defined in such a way that they are linearly related to x1 and x2 ,
respectively.
While rxy = 0 implies no linear relationship, rxy = 1 and rxy = −1 indicates perfect linear
relationship with a positive and negative relationship, respectively. Generally, |rxy | > 0.5
indicates a weak linear dependence, while |rxy | > 0.8 indicates a strong linear dependence.
An example is given in Figure 14.3.
211
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA
i xi (i − 0.5)/10 = Φ(zi ) zi
1 176 0.05 -1.64
2 180 0.15 -1.04
3 187 0.25 -0.67
4 190 0.35 -0.39
5 193 0.45 -0.13
6 196 0.55 0.13
7 201 0.65 0.39
8 205 0.75 0.67
9 211 0.85 1.04
10 220 0.95 1.64
Table 14.4: Constructing a normal probability plot from a given set of data. Value of zi is
obtained from the cumulative standard normal distribution table (see Probability chapter,
Figure 13.16).
Figure 14.4: Top row: normal probability plot of the data provided in Table 14.4, (cen-
ter) plotted using (xi , zi ) data given in the table and (right) plotted using scipy. Bottom
row: (center) histogram and (right) normal probability plot of randomly generated 500 data
points, according to the normal distribution. In conclusion, if the data set follows a normal
probability distribution, the data points lie along a straight line in the normal probability
plot. Can you guess mean and variance just by looking at the probability plot?
212
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA
Figure 14.5: Normal probability plot of non-normal distributions: (left) heavier-tailed [than
normal] and (right) skewed distribution.
matter in this case. In fact, this is the straight line you see in the normal probability plot.
Now, X vs. Z plotted using the Z values obtained from Eq. 14.5 falls on the straight line
[Eq. 14.6], provided X follows a normal distribution. Otherwise, if the data is not described
by a normal distribution, we should observe some deviation from the straight line [see
Figure 14.5 and Exercise].
14.2.6 Exercise
1. (py) Symmetric data: Normal distribution is an example of a symmetric distribution.
In this case, mean, median and mode are equal. You can use np.random.normal(µ, σ, n)
to create a random normal data set of size n. Take n = 50 (sample) and n = 5000
(population) and do the following.
• Use µ = 0, σ = 1. Generate the summary of the data set, plot the histogram and
box-and-whisker plot. Verify whether mean, median and mode are nearly equal
or not.
• Use µ = 0, σ = 2. Generate the summary of the data set, plot the histogram and
box-and-whisker plot. Verify whether mean, median and mode are nearly equal
or not.
Γ(α + β) α−1
f (x) = x (1 − x)β−1 , for 0 < x < 1. (14.7)
Γ(α)Γ(β)
β α α−1 −βx
f (x) = x e , for x > 0, α, β > 0. (14.8)
Γ(α)
213
14.2. DATA ANALYSIS: GRAPHICAL REPRESENTATION OF DATA
You can use np.random.gamma(α, β, n) to create a random gamma data set of size n.
Take n = 50 (sample) and n = 5000 (population) and do the following. Is the distribution
left skewed or right skewed? Generate the summary of the data set, plot the histogram
and box-and-whisker plot for different α, β values.
5. (py) Compare the cumulative histogram plots of a symmetric, left-skewed and right-
skewed data set.
6. (py) Understanding normal probability plot: In this problem we shall learn to draw the
normal probability plot from scratch completely by ourselves and also compare with
the plot generated from scipy.
• Generate 500 normal random variables with µ = 10 and σ = 2 and arrange them
in ascending order. Calculate Z in two different routes: first from Eq. 14.5 [hint:
use stats.norm.ppf()] and next from Eq. 14.6. Plot both the X vs. Z graphs on
top of each other. Compare with the normal probability plot obtained from scipy
using stats.probplot().
• Generate 500 beta random variables with α = 10 and β = 2 and arrange them in
ascending order. Calculate Z in two different routes: first from Eq. 14.5 [hint:
use stats.norm.ppf()] and next from Eq. 14.6. Plot both the X vs. Z graphs on
top of each other. Compare with the normal probability plot obtained from scipy
using stats.probplot().
7. (py) Heavier-tailed (than normal) distribution: If the probability density function de-
cays slowly than compared to the normal distribution, we call it a heavier-tailed (than
normal) distribution. The name originates from the fact that there exist more data
points near the tails, than that of a normal distribution.
10. (py) Slope and intercept of the straight line in normal probability plot: Generate two
different random variables with µ = 15, σ = 4 and µ = 10, σ = 2. Plot both of them on
the same normal probability plot and compare.
214
14.3. POINT ESTIMATION OF PARAMETERS
Figure 14.6: An example of random sampling (X1 , X2 , ....., X200 , each sample containing 10
data points) and sampling distribution of sample averages x1 , x2 , ....., x200 , which follows a
normal distribution, according to the central limit theorem.
11. (py) Scatter plot and correlation heat map: Generate two random variables x1 and
x2 , as shown in Figure 14.3. Define x3 = x1 ∗ x1 + np.random.normal(0, 0.2, 2500) and
x4 = x2 ∗ x2 ∗ x2 + np.random.normal(0, 0.2, 2500). Generate the scatter plot matrix and
correlation heat map.
215
14.3. POINT ESTIMATION OF PARAMETERS
points form the same population and the sample average turns out to be x2 . This is known
as random sampling. Because of random sampling, the sample average will vary and we get
a sampling distribution. Let us write a python code to do this exercise. First, we generate
p (p = nsample in the code shown in Figure 14.6) different samples X1 , X2 , ......, Xp ,5 each
sample containing n = 10 number of data points. Each sample is drawn from the same
population (µ = 10, σ = 2). We also calculate the sample average or sample mean for each
sample.
X1 = {x1 , x2 , ....., xn }1 ⇒ x1 ,
X2 = {x1 , x2 , ....., xn }2 ⇒ x2 ,
..............................................
Xp = {x1 , x2 , ....., xn }p ⇒ xp .
• Sample average values x1 , x2 , ....., xp are different for every sample, although all of them
are drawn from the same population (µ = 10, σ = 2). Thus, we can define a new
random variable X = {x1 , x2 , · · ·, xp }, which is the sampling distribution of average
values. Modify the code [Figure 14.6] and plot the histogram of X. Do you see that
some of the sample averages differ significantly from the population mean?
• Similarly, we can define another random variable S 2 = {s21 , s22 , ···, s2p }, which is the sam-
pling distribution of variance. Modify the code [Figure 14.6] and plot the histogram of
S2.
• Modify the code [Figure 14.6] and calculate the expectation value of X and S 2 and
verify the following,
E(X) = µ, E(S 2 ) = σ 2 . (14.10)
The last equation is something you would expect intuitively. You can find the mathematical
proof in any standard statistics text book.
Thus, it is clear that X = {x1 , x2 , ....., xp } is a random variable. We can further verify that
X is a normal random variable with mean µ and variance σ 2 /n. If the claim is true, we can
define a standard normal variable,
X −µ
Z= √ . (14.11)
σ/ n
Using the code [Figure 14.6] we can verify that Z has a mean of 0 and variance of 1. The
above equation is known as the central limit theorem.6 In words, given a random sample X
of size n taken from a population with mean µ and variance σ 2 , the sample average X is a
normal random variable, with mean and variance equal to,
2 σ2
µX = µ, σX = . (14.12)
n
Remember that we started with a population following the normal distribution. Will
the central limit theorem hold good if the population follows some other distribution? It
turns out that, if the sample size is large enough (n > 30), central limit theorem holds good
irrespective of the probability distribution followed by the population. This implies that X
can have any probability distribution, X is going to follow normal distribution if n is large
enough [see exercise]. If X follows a normal distribution, then central limit theorem holds
good even for a small sample size (n ≈ 5).
5
These are independent random variables and each one has same probability distribution.
6
We shall not attempt to prove central limit theorem, but just be happy with the “numerical proof”.
216
14.3. POINT ESTIMATION OF PARAMETERS
Example: Machine parts are manufactured having a mean length of 100 mm and stan-
dard deviation of 10 mm. What is the probability that a random sample of size 25 have a
sample average greater than 105 mm?
2 = 4, the standardized value of X = 105 is Z = 2.5, and thus, P (X > 105) =
Since σX
1 − P (Z < 2.5) = 1 − 0.9938 = 0.0062. In conclusion, such a large deviation of sample average
from population mean is a very rare event. One can further verify that P (100 < X < 101) =
, P (102 < X < 103) =, P (104 < X < 105) = etc., which clearly shows that the probability of
getting a sample average differing significantly from the population mean is very small.
Example: Using the data given in Table 14.1, let us calculate the sample mean and stan-
dard error of the sample mean.
Point estimate of mean (sample mean) is µ̂ = x = 155.125. Standard error of the sample
√
mean is σX = σ/ n, where n = 40. Since we do not know σ, we use sample standard
√
deviation s = 35.91, such that the point estimate of the standard error is σ̂X = s/ n = 5.68.
Note that, standard error is reasonably small (∼ 3% of the sample mean). It is safe to
assume that the true mean lies within a range of 155.125 ± 2 × 5.68.
2. (py) Repeat the exercise shown in Figure 14.6 and verify whether central limit theorem
works if the population follows an exponential distribution.
4. Machine parts are manufactured having a mean length of 100 mm and standard
deviation of 10 mm. Compare the probability that a random sample have a sample
average greater than 105 mm for different sample size n = 5, 10, 15, 20, 25, 30.
7
Provided the sample size is reasonably large.
217
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL
P (Θl ≤ θ ≤ Θu ) = α. (14.13)
For p different random samples X1 , X2 , ....., Xp , we shall get p−pairs of values (θl , θu )1 , (θl , θu )2 , ....., (θl , θu )p
Let us try to understand what it means with some specific example.
X −µ
Z= √ . (14.14)
σ/ n
Once we have defined a standard random variable Z, now we can define an interval (−cα , cα )
such that [Figure 14.8],
� �
X −µ
P (−cα ≤ Z ≤ cα ) = P −cα ≤ √ ≤ cα = α, where 0 ≤ α ≤ 1.
σ/ n
For a given α, we know the value of cα [Figure 14.8]. From the above equation, we can
write,
X −µ
− cα ≤ √ ≤ cα ,
σ/ n
218
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL
Figure 14.7: Constructing the confidence interval for µ repeatedly. The dots at the center
of the intervals represent the sample average x. Some of the intervals do not contain
population mean µ. The interval width shortens if we have more data points in a sample.
1-α/2 1-α/2
-cα cα
Figure 14.8: Choice of cα for different confidence intervals. For example, for α = 0.9, 0.95
and 0.99, cα = 1.65, 1.96 and 2.58, respectively.
X1 = {11.39, 7.46, 13.78, 11.98, 8.06, 11.69, 9.86, 12.02, 9.12, 11.73} ,
x1 = 10.71. Thus, if we want a 95% confidence interval, cα = 1.96 and σ = 2, n = 10, such
that the confidence interval is 9.47 ≤ µ ≤ 11.95. If we take another sample X2 , we get some
other interval because X changes. Several such intervals can be constructed for different
samples, as shown in Figure 14.7.
219
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL
E=|x μ |
x-cασ/√n x μ x+cασ/√n
√
Figure 14.9: Error in estimating µ. Note that E ≤ cα σ/ n.
many such intervals, 95% of them will contain the population mean µ.
Precision of the confidence interval and choice of sample size: Width of a confidence
interval is given by,
cα σ 3.92σ 5.16σ
2√ = √ for α = 0.95, and √ for α = 0.99.
n n n
Thus, to get a higher level of confidence, the interval width must be longer. Length of a
confidence interval is a measure of the precision of estimation. Thus, precision is inversely
proportional to the confidence level. For the purpose of decision making, it is desirable
to get adequate confidence level, but the interval should be short enough at the same
time. One way of obtaining this is to increase the number of data points n in a sample [
Figure 14.7].
We can even estimate the desirable sample size for a specified error. If we are using
√
x to estimate population mean µ, the error E = |x − µ| is less than or equal to cα σ/ n
[Figure 14.9]. Thus, for a specified E, we can find the appropriate sample size to be,
� c σ �2
α
n= . (14.16)
E
In words, if we want to use x as a point estimate of µ, we can be 100α% confident that the
error |x − µ| will not exceed a specified amount E when the sample size is given by the above
equation. Note that, if the right hand side of the above equation is not an integer, it needs
to be rounder up.
Example: In the previous example, 9.47 ≤ µ ≤ 11.95, E = (11.95 − 9.47)/2 = 1.24, cα = 1.96
(for 95% confidence interval), σ = 2, such that n = 10. If we want E = 0.555,
� �
1.96 × 2 2
n= ≈ 50.
0.555
See Figure 14.7 to understand the difference between n = 10 and n = 50. Do you realize
that while a sample of bigger size helps to improve the precision, it does not enhance the
confidence level.
220
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL
n=10
n=50
Figure 14.10: Replacing sample variance S in place of population variance σ in Eq. 14.15
works if the sample size is large enough.
X −µ
T = √ (14.17)
S/ n
For example, tα = 2.228 for k = 10 and α = 0.975 implies that area under the curve up to tα
is P (X ≤ tα ) = 0.975. The shaded region (beyond tα ) has an area 1 − α = 0.025. Note that,
t−distribution is identical with standard normal distribution at k = ∞.
For constructing the confidence interval, we need to decide the value of tα . For this pur-
8
After English statistician William Sealy Gosset, known as student.
221
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL
1-α
1-α/2 1-α/2
tα
-tα tα
Figure 14.11: (left) Probability distribution function of a t−distribution for different param-
eters. It is identical to a standard normal distribution at k = ∞. (center) Percentage points
of t−distribution, denoted by tα and tabulated for different k and α values. (right) Choice
of tα for different confidence intervals. The last row (k = ∞) is identical to the standard
normal distribution.
Figure 14.12: A skewed beta distribution (parameters α = 1 and β = 5): histogram and
probability plot of a sample of size n = 50.
pose, we use the symmetry of t−distribution. For example, to construct a 95% confidence
interval, we select the values given in 0.025 column. Once we have select a suitable value
of tα , the confidence interval is given by,
tα S tα S
X− √ ≤µ≤X+ √ , (14.19)
n n
where X, S are calculated from a given sample, containing n data points. Note that, unlike
the case of known population variance, we can not define interval width, because S varies
from one sample to another, which adds more randomness to the whole process. From
Figure 14.11, it is clear that the constant tα is greater than the corresponding cα for a given
confidence interval. This provides some extra cushion, hoping to take care of the added
randomness. However, if the sample size is large (n ≈ 50), we can even use cα in place of tα .
However, for smaller sample size, it is safer to use t−distribution when population variance
is unknown.
222
14.4. INTERVAL ESTIMATION OF PARAMETERS: CONFIDENCE INTERVAL
Figure 14.13: Sample average x1 , x2 , ....., x200 of independent non-normal random variables
X1 , X2 , ....., X200 follow normal distribution provided sample size is large enough. Each sam-
ple is randomly generated according to the skewed beta distribution shown in Figure 14.12.
before,
X −µ
Z= √ , (14.20)
S/ n
where X, S are sample average and sample standard deviation, µ is the population mean
and n is the sample size. If Z if found to be a standard normal variable, then all the tricks
we have learned so far can be applied.
Let us check using a beta distribution with parameters α = 1, β = 5 and it has a pop-
ulation mean equal to α/(α + β). From the histogram and normal probability plot [Fig-
ure 14.12], it is obvious that we have a right-skewed distribution. Now we can define a
random variable Z using Eq. 14.20, as illustrated in Figure 14.13. I have compared two
cases: first in which n = 10 points are used to calculate the sample average and sample
standard deviations and second in which n = 50 points are used to calculate the sample
average and sample standard deviation. From the histograms and normal probability plots,
it is obvious that for n = 10, Z is not a standard normal variable, while for n = 50, Z is a
standard normal variable. In conclusion, we can define a standard normal variable using
Eq. 14.20 for any probability distribution, provided the sample size is large enough. Re-
member that, if it is a normal distribution, the same method works even for a sample size
as small as n = 10. For a non-normal distribution, all we need is a larger sample n ≈ 50.
This is amazing that sample average of any general independent random variables fol-
low a normal distribution when sample size is large. This is related to one of the most
fundamental results in probability theory, the central limit theorem. Once it is known
that X follows a normal distribution, we follow same procedure to construct the confidence
interval,
cα S cα S
X− √ ≤µ≤X+ √ ,
n n
where X and S are calculated from a sample of size n. A summary of confidence interval
construction is given in Table 14.5.
223
14.5. HYPOTHESIS TESTING: DECISION-MAKING
α
α α
Table 14.5: Construction of confidence interval (CI) for µ, where the sample comes from
a normal and non-normal population. If the population is normal, the method works for
small number of data points (n ≈ 10) in a sample. In the population is non-normal, we need
a sample with more data points (n ≈ 50).
14.4.4 Exercise
1. The length of a rod is a normal random variable distributed with σ = 1 cm. Ten
measurements are as follows: 50.7, 50.5, 50.6, 50.5, 50.3, 50.6, 50.8, 50.2, 50.3,
50.1. Find a 95% CI for µ, the mean length.
2. Repeat the same problem, this time for 99% CI.
3. From the above two problems, verify that the precision level for 95% CI is higher than
that of 99% CI. What would you do to improve the precision level of 99% CI?
4. Plot n vs. E (absolute error) for 95% CI and 99% CI on a paper.
5. Plot n vs. E (absolute error) for 95% CI with σ = 2 and σ = 4 on a paper.
6. (py) Plot n vs. E (absolute error) for 95% CI and 99% CI using python. Assume σ = 2.
7. (py) Plot n vs. E (absolute error) for 95% CI with σ = 2 and σ = 4 using python.
8. Based on the plots, what would you conclude?
224