Analysis of Statistcal Data

Analysis of Statistical Data
Central Tendency (Center) and

Dispersion (Variability)
 Central tendency: measures of the degree to
which scores are clustered around the mean of a
distribution
 Dispersion: measures the fluctuations (variability)

around the characteristics of central tendency
Measures of Center
• A measure along the horizontal axis of
the data distribution that locates the
center of the distribution.
Arithmetic Mean or Average
• The mean of a set of measurements is
the sum of the measurements divided
by the total number of measurements.
n
∑x i
x= i =1
n
where n = number of measurements

∑ xi =sum of all the measurements
Example
•The set: 2, 9, 1, 5, 6
∑ xi 2 + 9 + 11 + 5 + 6 33
x= = = = 6.6
n 5 5
If we were able to enumerate the whole

population, the population mean would be
called µ (the Greek letter “mu”).
Example:
 Resistance of 5 coils:
3.35, 3.37, 3.28, 3.34, 3.30 ohm.
 The average:
∑x3.35 + 3.37 + 3.28 + 3.34 + 3.30

i
=x =
i =1
= 3.33
n 5
Weighted Mean
 The Weighted mean of the positive real numbers
x1,x2, ..., xn with their weight w1,w2, ..., wn is defined
to be
n
∑
i =1
wi xi
x= n
∑w
i =1
i
Geometric Mean
 Geometric mean is defined as the positive root of the
product of observations. Symbolically,
GM = ( x1 x2 x3  xn ) 1/ n
 It is also often used for a set of numbers whose values are

are exponential in nature, such as data on the growth of the
human population or interest rates of a financial
investment.
 Find geometric mean of rate of growth: 34, 27, 45, 55, 22, 34
Harmonic Mean
 The harmonic mean is the number of variables divided
by the sum of the reciprocals of the variables.
n
HM = n
1
∑
i =1 xi
 Useful for ratios such as speed (=distance/time) etc.
 Exercise: Find the the harmonic mean of 1, 2, and 4

Median
• The median of a set of measurements
is the middle measurement when the
measurements are ranked from
smallest to largest.
• The position of the median is
0.5(n + 1)
once the measurements have been

ordered.
Example
 The set : 2, 4, 9, 8, 6, 5, 3 n=7
 Sort : 2, 3, 4, 5, 6, 8, 9
 Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 4th largest measurement
• The set: 2, 4, 9, 8, 6, 5 n=6

• Sort: 2, 4, 5, 6, 8, 9
• Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 — average of the 3rd and 4th
measurements
Mode
• The mode is the measurement which occurs
most frequently.
• The set: 2, 4, 9, 8, 8, 5, 3
• The mode is 8, which occurs twice
• The set: 2, 2, 9, 8, 8, 5, 3
• There are two modes—8 and 2 (bimodal)
• The set: 2, 4, 9, 8, 5, 3
• There is no mode (each value is unique).
Example
The number of quarts of milk purchased by 25
households:
0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3
3 3 3 4 4 4 5
 Mean?
∑ xi 55
x= = = 2.2 10/25
n 25 8/25
Relative frequency
 Median? 6/25
m=2 4/25
 Mode? (Highest peak) 2/25
mode = 2
0
0 1 2 3 4 5
Quarts
Extreme Values
 The mean is more easily affected by extremely
large or small values than the median.
•The median is often used as a measure of

center when the distribution is skewed.
Extreme Values
Symmetric: Mean = Median
Skewed right: Mean > Median
Skewed left: Mean < Median

Measures of Variability
• A measure along the horizontal axis of the data distribution
that describes the spread of the distribution from the center.
 Range
Difference between maximum and minimum values
 Interquartile Range
Difference between third and first quartile (Q3 - Q1)

 Variance
Average*of the squared deviations from the mean

 Standard Deviation
Square root of the variance

Variability
Variability
No Variability
The Range
• The range, R, of a set of n measurements is the
difference between the largest and smallest
measurements.
• Example: A botanist records the number of
petals on 5 flowers:
5, 12, 6, 8, 14
• The range is R = 14 – 5 = 9.
Quartiles
Q1 Q2 Q3
25% 25% 25% 25%

Percentile
50th Percentile ≡ Median (Q2)
25th Percentile ≡ Lower Quartile (Q )
1
75th Percentile ≡ Upper Quartile (Q )
3
Interquartile Range:
IQR=Q3 – Q1
• The position of p-th percentile is 0.p(n + 1)
• The position of Q1 is 0.25(n + 1)
•The position of Q3 is 0.75(n + 1)
once the measurements have been ordered.

If the positions are not integers, find the
quartiles by interpolation.
Example
The prices ($) of 18 brands of walking shoes:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = 0.25(18 + 1) = 4.75

Position of Q3 = 0.75(18 + 1) = 14.25
Q1is 3/4 of the way between the 4th and 5th ordered
measurements, or Q1 = 65 + 0.75(65 - 65) = 65.
Example
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = 0.25(18 + 1) = 4.75

Position of Q3 = 0.75(18 + 1) = 14.25
Q3 is 1/4 of the way between the 14th and 15th

ordered measurements, or
Q3 = 74 + .25(75 - 74) = 74.25
and
IQR = Q3 – Q1 = 74.25 - 65 = 9.25
90-th percentile P90
 The position of 90-th percentile is
0.9(18 + 1)=17.1

40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
P90 = 90 + .10 (95-90) = 90.5

The Variance
• The variance is measure of variability that uses
all the measurements. It measures the average
deviation of the measurements about their
mean.
• Flower petals: 5, 12, 6, 8, 14
45
x= =9
5
4 6 8 10 12 14
The Variance
• The variance of a population of N measurements
is the average of the squared deviations of the
measurements about their mean µ.
∑ ( x − µ ) 2
σ2 = i
N
• The variance of a sample of n measurements is the sum

of the squared deviations of the measurements about their
mean, divided by (n – 1).
∑ ( x − x ) 2
s2 = i
n −1
The Standard Deviation
• In calculating the variance, we squared all of
the deviations, and in doing so changed the
scale of the measurements.
• To return this measure of variability to the
original units of measure, we calculate the
standard deviation, the positive square root of
the variance.
Population standard deviation : σ = σ 2

Sample standard deviation : s = s 2
Two Ways to Calculate the Sample Variance
Use the Definition Formula:
xi xi − x ( xi − x ) 2 ∑ ( x − x ) 2
s =
2 i
5 -4 16 n −1
12 3 9
60
6 -3 9 = = 15
8 -1 1 4
14 5 25
s = s = 15 = 3.87
2
Sum 45 0 60
Two Ways to Calculate the Sample Variance
Use the calculation formula:

xi xi2 (∑ xi )
2
∑ xi −
2
5 25 s2 = n
12 144 n −1
6 36 2
45
8 64 465 −
= 5 = 15
14 196 4
Sum 45 465
s = s 2 = 15 = 3.87
Example- ungrouped data
 Sample: Moisture content (%) of kraft paper are:
6.7, 6.0, 6.4, 6.4, 5.9, and 5.8.
(231.26) − (37.2) 2 6
s= = 0.35
(6 − 1)
 Sample standard deviation, s = 0.35
Using Measures of Center and Spread:
The Empirical Rule
Given a distribution of measurements
that is approximately mound-shaped:
The interval µ ± σ contains approximately 68% of the
measurements.
The interval µ ± 2σ contains approximately 95% of
the measurements.
The interval µ ± 3σ contains approximately 99.7% of
the measurements.
The Empirical Rule: An Example
Measures of Relative Standing
• Where does one particular measurement stand in
relation to the other measurements in the data
set?
• How many standard deviations away from the
mean does the measurement lie? This is measured
by the z-score.
Suppose s = 2. s
x−x 4
z - score= s s
s
x =5 x=9
x = 9 lies z =2 std dev from the mean.
z-Scores
• z-scores between –2 and 2 are not unusual. z-scores
should not be more than 3 in absolute value. z-scores
larger than 3 in absolute value would indicate a
possible outlier.
Outlier Not unusual Outlier

z
-3 -2 -1 0 1 2 3
Somewhat unusual
Example of z-Scores
X z-Score X z-Score
10 -1.28244 10 -0.29204
15 0.625954 500 3.473714
10 -1.28244 10 -0.29204
16 1.007634 16 -0.24593
11 -0.90076 11 -0.28435
17 1.389313 17 -0.23824
14 0.244275 14 -0.2613
13 -0.1374 13 -0.26898
10 -1.28244 10 -0.29204
16 1.007634 16 -0.24593
11 -0.90076 11 -0.28435
17 1.389313 17 -0.23824
14 0.244275 14 -0.2613
13 -0.1374 13 -0.26898
Coefficient of Variation(CV)
 When comparing between data sets with different units
or widely different means, one should use the
coefficient of variation for comparison instead of the
standard deviation.
 The Coefficient of Variation can be written as
s
CV =
x
 We express CV as a percentage by multiplying 100
Skewness
 Skewness measures the degree of asymmetry exhibited
by the data
 The data can exhibits +ve skewness or –ve skewness
 If the mean of the data is greater than its median, the

data is positively skewed; and if the mean of the data is
less than its median, the data is negatively skewed
n
∑ (x − x) 3
 Mathematically,
i
skewness = i =1
ns 3
42
Skewness
Mean Mode Mean Mean

Mode
Median
Median Mode Median
Negatively Symmetric Positively

Skewed (Not Skewed) Skewed
Kurtosis
 Kurtosis measure the peaking of the data relative to the
normal distribution
 Data with high degree of peakeness is said to be

leptokurtic and have the kaurtosis value more than 3
 Flat data has the kurtosis value of less than 3, and it is

called platykurtic n
∑ i
( x − x ) 4
 Mathematically, kurtosis = i =1
ns 4
44
Kurtosis
 Peakedness of a distribution
 Leptokurtic: high and thin
 Mesokurtic: normal in shape
 Platykurtic: flat and spread out
Leptokurtic
Mesokurtic
Platykurtic
Skewness and Kurtosis
46

Analysis of Statistcal Data

Uploaded by

Copyright:

Available Formats

You might also like

Analysis of Statistcal Data

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis of Statistcal Data

Uploaded by

Copyright:

Available Formats

Analysis of Statistical Data

Central Tendency (Center) and

 Dispersion: measures the fluctuations (variability)

where n = number of measurements

If we were able to enumerate the whole

∑x3.35 + 3.37 + 3.28 + 3.34 + 3.30

 It is also often used for a set of numbers whose values are

 Useful for ratios such as speed (=distance/time) etc.

 Exercise: Find the the harmonic mean of 1, 2, and 4

once the measurements have been

• The set: 2, 4, 9, 8, 6, 5 n=6

 Mode? (Highest peak) 2/25

•The median is often used as a measure of

Symmetric: Mean = Median

Skewed right: Mean > Median

Skewed left: Mean < Median

Difference between third and first quartile (Q3 - Q1)

Average*of the squared deviations from the mean

Square root of the variance

25% 25% 25% 25%

• The position of Q1 is 0.25(n + 1)

•The position of Q3 is 0.75(n + 1)

once the measurements have been ordered.

Position of Q1 = 0.25(18 + 1) = 4.75

Position of Q1 = 0.25(18 + 1) = 4.75

Q3 is 1/4 of the way between the 14th and 15th

The prices ($) of 18 brands of walking shoes:

P90 = 90 + .10 (95-90) = 90.5

• The variance of a sample of n measurements is the sum

Population standard deviation : σ = σ 2

Use the calculation formula:

Outlier Not unusual Outlier

 The data can exhibits +ve skewness or –ve skewness

 If the mean of the data is greater than its median, the

Mean Mode Mean Mean

Negatively Symmetric Positively

 Data with high degree of peakeness is said to be

 Flat data has the kurtosis value of less than 3, and it is

You might also like