Professional Documents
Culture Documents
Statistics: Bikash T. Magar
Statistics: Bikash T. Magar
Bikash T. Magar
Variation
• In statistics to describe the data set accurately, measures of central tendency
isn’t just enough. For the variability of a data set, three measures are : range,
variance and standard deviation.
• Karl Pearson in 1892 and 1893 introduced the statistical concepts of the range
and standard deviation.
• Range
- The range is the highest value minus the lowest value.
One extremely high or one extremely low data value can affect the range
markedly. So, to have more meaningful statistic to measure the variability,
we use the variance and standard deviation.
1
• Variance and Standard Deviation
-Data variation is based on the difference each data value in from the mean.
The difference is called a deviation.
The population variance(σ 2 ) is the average of the square of the distance each
value is from the mean.
(X − µ)2
P
2
σ =
N
where X = individual value,µ = mean and N = size
The population standard deviation is the square root of the variance.
sP
(X − µ)2
σ=
N
We take the square of the deviation as (X − µ) = 0 is always the case. When
P
the means of two different sets of data are equal, the larger the variance or
standard deviation, the more variable the data are.
• The purpose of calculating the statistic is to estimate the corresponding param-
eter. When computing the variance for a sample, the sample mean X̄ is used
to estimate the population mean µ. But,
(X − X̄)2
P
n
does not give the best estimate of the population variance because when the pop-
ulation is large and the sample is small, the variance computed by this formula
usually underestimates the population variance. Therefore, sample variation is
(X − X̄)2
P
2
s =
n−1
where X = individual value, X̄ = sample mean and n =sample size
The sample standard deviation is
sP
(X − X̄)2
s=
n−1
• The shortcut method for computing the sample standard deviation for data
obtained from samples are as follows:
v
u P 2
u n( X ) − ( X)2
P
s= t
n(n − 1)
2
The shortcut method is mathematically equivalent to the above method and do
not involve using the mean. It is also more accurate when the mean has been
rounded.
• Computational method for Grouped data
Sample variance:
2
) − ( f.Xm )2
P P
2 n( f.Xm
s =
n(n − 1)
Sample standard deviation:
v
2)−( f.Xm )2
u P P
u n( f.Xm
s= t
n(n − 1)
where Xm is the midpoint of each class and f is the frequency of each class.
• The coefficient of variation is the standard deviation divided by the mean and
the result is expressed as a percentage.
For samples:
s
CV ar = .100
X̄
For population:
σ
CV ar = .100
µ
• The Range Rule of Thumb
A rough estimate of the standard deviation is s ≈ range
4
This rule is only an
approximation and should be used when the distribution of data values is uni-
modal and roughly symmetric.
• Chebyshev’s theorem
-’The proportion of values from a data set that will fall within k standard
deviations of the mean will be at least 1 − (1/k 2 ), where k is a number greater
than 1 (k is not necessarily an integer).’
The theorem specifies the proportions of the spread in terms of the standard
deviation. For example a data has a mean of 60 and a standard deviation of 2.
Using the expression,
1
1− 2
k
1 3
⇒ 1 − 2 = = 75%
2 4
At least 75% of the data values fall between 56 and 64.
3
• The Empirical (Normal) Rule
-Chebyshev’s theorem applies to any distribution regardless of its shape. How-
ever, when a distribution is bell-shaped(Normal), the following statements are
true:
1. Approximately 68% of the data will fall within 1 standard deviation of the
mean
2. Approximately 95% of the data values will fall within 2 standard deviations
of the mean.
3. Approximately 99.7% of the data values will fall within 3 standard devia-
tions of the mean.
Skewness
• A measure of skewness describes the degree and direction of asymmetry or
departure from symmetry of a distribution. It’s value equal to 0 indicates that
the distribution is symmetric and M ean = M edian = M ode. The farther the
measure is from 0, the greater the degree of symmetry. The sign of the measure
indicates the direction of skewness.
• A distribution that is asymmetric is said to be skewed.
1. Positively skewed
-The tail of the curve of the distribution elongates more on the right. Also,
M ean > M edian > M ode
2. Negatively skewed
-The tail of the curve of the distribution elongates more on the left. Also,
M ean < M edian < M ode
• Measure of Skewness
1. Pearson’s first skewness coefficient (mode skewness) is
X̄ − M0
Sk =
σ
2. Pearson’s second skewness coefficient (median skewness) is
3(X̄ − Md )
Sk =
σ
Above measure is used when the distribution is bimodal.
4
3. Bowley’s measure of skewness also called Yule’s coefficient is one of the
Quantile-based measures which is defined as
Q3 +Q1
2
− Q2 Q3 − 2Q2 + Q1
B= Q3 −Q1 =
2
Q3 − Q1
The numerator is difference between the average of the upper and lower
quartiles and the median, while the denominator is the semi-interquartile
range.
Correlation
• Correlation coefficient
- Statisticians use a measure called the correlation coefficient to determine the
strength of the linear relationship between two variables. Types of correlation
are:
5
- Pearson’s product moment correlation coefficient is one of the linear correlation
coefficients. It’s value will always lies between -1 and +1 inclusively i.e −1 ≤
r ≤ 1.
Figure 4: Relationship between the correlation coefficient and the scatter plot
6
• Deviation method is one of the methods to calculate the value of the Pearson’s
product moment correlation coefficient which is
xy) − (
P P P
n( x)(
y)
r=q P
[n( x2 ) − ( x)2 ][n( y 2 ) − ( y)2 ]
P P P
• Best fit means that the sum of the squares of the vertical distances from each
point to the line is at a minimum.
• The difference between the actual value y and the predicted value y 0 is called a
residual or a predicted error. The method used for making the residual as small
as possible is called the method of least squares. As a result of this method,
the regression line is also called the least squares regression line.
7
• The reason behind a line of best is that the values of y will be predicted from
the values of x; hence, the closer the points are to the line, the better the fit
and the prediction will be.
y = a + bx
where
y)( x2 ) − ( x)( xy)
P P P P
(
a=
n( x2 ) − ( x)2
P P
xy) − ( x)( y)
P P P
n(
b=
n( x ) − ( x)2
P 2 P
8
Figure 7: A Line as Represented in Algebra and in Statistics