Statistics: Bikash T. Magar

Statistics
Bikash T. Magar
Variation
• In statistics to describe the data set accurately, measures of central tendency
isn’t just enough. For the variability of a data set, three measures are : range,
variance and standard deviation.
• Karl Pearson in 1892 and 1893 introduced the statistical concepts of the range
and standard deviation.
Figure 1: Karl Pearson (1856-1936)
• Range
- The range is the highest value minus the lowest value.
R = highest value − lowest value
One extremely high or one extremely low data value can affect the range
markedly. So, to have more meaningful statistic to measure the variability,
we use the variance and standard deviation.
1
• Variance and Standard Deviation
-Data variation is based on the difference each data value in from the mean.
The difference is called a deviation.
The population variance(σ 2 ) is the average of the square of the distance each
value is from the mean.
(X − µ)2
P
2
σ =
N
where X = individual value,µ = mean and N = size
The population standard deviation is the square root of the variance.
sP
(X − µ)2
σ=
N
We take the square of the deviation as (X − µ) = 0 is always the case. When
P
the means of two different sets of data are equal, the larger the variance or
standard deviation, the more variable the data are.
• The purpose of calculating the statistic is to estimate the corresponding param-
eter. When computing the variance for a sample, the sample mean X̄ is used
to estimate the population mean µ. But,
(X − X̄)2
P
n
does not give the best estimate of the population variance because when the pop-
ulation is large and the sample is small, the variance computed by this formula
usually underestimates the population variance. Therefore, sample variation is
(X − X̄)2
P
2
s =
n−1
where X = individual value, X̄ = sample mean and n =sample size
The sample standard deviation is
sP
(X − X̄)2
s=
n−1
• The shortcut method for computing the sample standard deviation for data
obtained from samples are as follows:
v
u P 2
u n( X ) − ( X)2
P
s= t
n(n − 1)
2
The shortcut method is mathematically equivalent to the above method and do
not involve using the mean. It is also more accurate when the mean has been
rounded.
• Computational method for Grouped data
Sample variance:
2
) − ( f.Xm )2
P P
2 n( f.Xm
s =
n(n − 1)
Sample standard deviation:
v
2)−( f.Xm )2
u P P
u n( f.Xm
s= t
n(n − 1)
where Xm is the midpoint of each class and f is the frequency of each class.
• The coefficient of variation is the standard deviation divided by the mean and
the result is expressed as a percentage.
For samples:
s
CV ar = .100
X̄
For population:
σ
CV ar = .100
µ
• The Range Rule of Thumb
A rough estimate of the standard deviation is s ≈ range
4
This rule is only an
approximation and should be used when the distribution of data values is uni-
modal and roughly symmetric.
• Chebyshev’s theorem
-’The proportion of values from a data set that will fall within k standard
deviations of the mean will be at least 1 − (1/k 2 ), where k is a number greater
than 1 (k is not necessarily an integer).’
The theorem specifies the proportions of the spread in terms of the standard
deviation. For example a data has a mean of 60 and a standard deviation of 2.
Using the expression,
1
1− 2
k
1 3
⇒ 1 − 2 = = 75%
2 4
At least 75% of the data values fall between 56 and 64.
3
• The Empirical (Normal) Rule
-Chebyshev’s theorem applies to any distribution regardless of its shape. How-
ever, when a distribution is bell-shaped(Normal), the following statements are
true:
1. Approximately 68% of the data will fall within 1 standard deviation of the
mean
2. Approximately 95% of the data values will fall within 2 standard deviations
of the mean.
3. Approximately 99.7% of the data values will fall within 3 standard devia-
tions of the mean.
Skewness
• A measure of skewness describes the degree and direction of asymmetry or
departure from symmetry of a distribution. It’s value equal to 0 indicates that
the distribution is symmetric and M ean = M edian = M ode. The farther the
measure is from 0, the greater the degree of symmetry. The sign of the measure
indicates the direction of skewness.
• A distribution that is asymmetric is said to be skewed.
1. Positively skewed
-The tail of the curve of the distribution elongates more on the right. Also,
M ean > M edian > M ode
2. Negatively skewed
-The tail of the curve of the distribution elongates more on the left. Also,
M ean < M edian < M ode
• Measure of Skewness
1. Pearson’s first skewness coefficient (mode skewness) is
X̄ − M0
Sk =
σ
2. Pearson’s second skewness coefficient (median skewness) is
3(X̄ − Md )
Sk =
σ
Above measure is used when the distribution is bimodal.
4
3. Bowley’s measure of skewness also called Yule’s coefficient is one of the
Quantile-based measures which is defined as
Q3 +Q1
2
− Q2 Q3 − 2Q2 + Q1
B= Q3 −Q1 =
2
Q3 − Q1
The numerator is difference between the average of the upper and lower
quartiles and the median, while the denominator is the semi-interquartile
range.
Correlation
• A scatter plot is a graph of the ordered pairs (X, Y ) of numbers consisting of

the independent variable x and the dependent variable y.
Figure 2: Types of correlation
• Correlation coefficient
- Statisticians use a measure called the correlation coefficient to determine the
strength of the linear relationship between two variables. Types of correlation
are:
1. The population correlation coefficient (ρ) is the correlation computed by

using all possible pairs of data values (x,y) taken from a population.
2. The linear correlation coefficient (r) computed from the sample data mea-
sures the strength and direction of a linear relationship between two quan-
titative variables.
5
- Pearson’s product moment correlation coefficient is one of the linear correlation
coefficients. It’s value will always lies between -1 and +1 inclusively i.e −1 ≤
r ≤ 1.
Figure 3: Significance of range of values of the correlation coefficient
• As the value of the correlation coefficient increases from 0 to +1 and 0 to -1,

data values in scatter plots becomes closer to a straight line and to an increasing
strong relationship.
Figure 4: Relationship between the correlation coefficient and the scatter plot
• Assumptions for the correlation coefficient:
1. The sample is a random sample.

2. The data pairs fall approximately on a straight line and are measured at
the interval or ratio level.
3. The variables have a bivariate normal distribution.
6
• Deviation method is one of the methods to calculate the value of the Pearson’s
product moment correlation coefficient which is
xy) − (
P P P
n( x)(
y)
r=q P
[n( x2 ) − ( x)2 ][n( y 2 ) − ( y)2 ]
P P P
where n is the number of data pairs.

Regression
• After computing the value of the correlation coefficient, significance of the re-
lationship is tested. If the value of the correlation coefficient is significant, the
next step is to determine the equation of the regression line, which is the data’s
line of best fit. The purpose of the regression line is to enable the researcher to
see the trend and make predictions on the basis of the data.
Figure 5: A scatter plot with three lines fit to a data
• Best fit means that the sum of the squares of the vertical distances from each
point to the line is at a minimum.
• The difference between the actual value y and the predicted value y 0 is called a
residual or a predicted error. The method used for making the residual as small
as possible is called the method of least squares. As a result of this method,
the regression line is also called the least squares regression line.
7
• The reason behind a line of best is that the values of y will be predicted from
the values of x; hence, the closer the points are to the line, the better the fit
and the prediction will be.
Figure 6: A line of best fit for a data
• Determination of equation of regression line

- The equation of the regression line is written as
y = a + bx
where
y)( x2 ) − ( x)( xy)
P P P P
(
a=
n( x2 ) − ( x)2
P P
which is the y’ intercept of the line
xy) − ( x)( y)
P P P
n(
b=
n( x ) − ( x)2
P 2 P
which is the slope of line
• Round the values of a and b to three decimal places.
8
Figure 7: A Line as Represented in Algebra and in Statistics

Statistics: Bikash T. Magar

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics: Bikash T. Magar

Uploaded by

Copyright:

Available Formats

Statistics

Figure 1: Karl Pearson (1856-1936)

R = highest value − lowest value

• A scatter plot is a graph of the ordered pairs (X, Y ) of numbers consisting of

Figure 2: Types of correlation

1. The population correlation coefficient (ρ) is the correlation computed by

Figure 3: Significance of range of values of the correlation coefficient

• As the value of the correlation coefficient increases from 0 to +1 and 0 to -1,

• Assumptions for the correlation coefficient:

1. The sample is a random sample.

where n is the number of data pairs.

Figure 5: A scatter plot with three lines fit to a data

Figure 6: A line of best fit for a data

• Determination of equation of regression line

which is the y’ intercept of the line

which is the slope of line

• Round the values of a and b to three decimal places.

You might also like