Analysis of Statistcal Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Analysis of Statistical Data

Central Tendency (Center) and


Dispersion (Variability)
 Central tendency: measures of the degree to
which scores are clustered around the mean of a
distribution

 Dispersion: measures the fluctuations (variability)


around the characteristics of central tendency
Measures of Center
• A measure along the horizontal axis of
the data distribution that locates the
center of the distribution.
Arithmetic Mean or Average
• The mean of a set of measurements is
the sum of the measurements divided
by the total number of measurements.
n

∑x i
x= i =1
n

where n = number of measurements


∑ xi =sum of all the measurements
Example
•The set: 2, 9, 1, 5, 6

∑ xi 2 + 9 + 11 + 5 + 6 33
x= = = = 6.6
n 5 5

If we were able to enumerate the whole


population, the population mean would be
called µ (the Greek letter “mu”).
Example:
 Resistance of 5 coils:
3.35, 3.37, 3.28, 3.34, 3.30 ohm.
 The average:

∑x3.35 + 3.37 + 3.28 + 3.34 + 3.30


i
=x =
i =1
= 3.33
n 5
Weighted Mean
 The Weighted mean of the positive real numbers
x1,x2, ..., xn with their weight w1,w2, ..., wn is defined
to be
n


i =1
wi xi
x= n

∑w
i =1
i
Geometric Mean
 Geometric mean is defined as the positive root of the
product of observations. Symbolically,

GM = ( x1 x2 x3  xn ) 1/ n

 It is also often used for a set of numbers whose values are


are exponential in nature, such as data on the growth of the
human population or interest rates of a financial
investment.

 Find geometric mean of rate of growth: 34, 27, 45, 55, 22, 34
Harmonic Mean
 The harmonic mean is the number of variables divided
by the sum of the reciprocals of the variables.
n
HM = n
1

i =1 xi

 Useful for ratios such as speed (=distance/time) etc.

 Exercise: Find the the harmonic mean of 1, 2, and 4


Median
• The median of a set of measurements
is the middle measurement when the
measurements are ranked from
smallest to largest.
• The position of the median is
0.5(n + 1)

once the measurements have been


ordered.
Example
 The set : 2, 4, 9, 8, 6, 5, 3 n=7
 Sort : 2, 3, 4, 5, 6, 8, 9
 Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 4th largest measurement

• The set: 2, 4, 9, 8, 6, 5 n=6


• Sort: 2, 4, 5, 6, 8, 9
• Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 — average of the 3rd and 4th
measurements
Mode
• The mode is the measurement which occurs
most frequently.
• The set: 2, 4, 9, 8, 8, 5, 3
• The mode is 8, which occurs twice
• The set: 2, 2, 9, 8, 8, 5, 3
• There are two modes—8 and 2 (bimodal)
• The set: 2, 4, 9, 8, 5, 3
• There is no mode (each value is unique).
Example
The number of quarts of milk purchased by 25
households:
0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3
3 3 3 4 4 4 5
 Mean?
∑ xi 55
x= = = 2.2 10/25

n 25 8/25

Relative frequency
 Median? 6/25

m=2 4/25

 Mode? (Highest peak) 2/25

mode = 2
0
0 1 2 3 4 5
Quarts
Extreme Values
 The mean is more easily affected by extremely
large or small values than the median.

•The median is often used as a measure of


center when the distribution is skewed.
Extreme Values

Symmetric: Mean = Median

Skewed right: Mean > Median

Skewed left: Mean < Median


Measures of Variability
• A measure along the horizontal axis of the data distribution
that describes the spread of the distribution from the center.

 Range
Difference between maximum and minimum values
 Interquartile Range

Difference between third and first quartile (Q3 - Q1)


 Variance

Average*of the squared deviations from the mean


 Standard Deviation

Square root of the variance


Variability

Variability

No Variability
The Range
• The range, R, of a set of n measurements is the
difference between the largest and smallest
measurements.
• Example: A botanist records the number of
petals on 5 flowers:
5, 12, 6, 8, 14
• The range is R = 14 – 5 = 9.
Quartiles

Q1 Q2 Q3

25% 25% 25% 25%


Percentile
50th Percentile ≡ Median (Q2)
25th Percentile ≡ Lower Quartile (Q )
1
75th Percentile ≡ Upper Quartile (Q )
3

Interquartile Range:
IQR=Q3 – Q1
• The position of p-th percentile is 0.p(n + 1)

• The position of Q1 is 0.25(n + 1)

•The position of Q3 is 0.75(n + 1)

once the measurements have been ordered.


If the positions are not integers, find the
quartiles by interpolation.
Example
The prices ($) of 18 brands of walking shoes:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95

Position of Q1 = 0.25(18 + 1) = 4.75


Position of Q3 = 0.75(18 + 1) = 14.25

Q1is 3/4 of the way between the 4th and 5th ordered
measurements, or Q1 = 65 + 0.75(65 - 65) = 65.
Example
The prices ($) of 18 brands of walking shoes:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95

Position of Q1 = 0.25(18 + 1) = 4.75


Position of Q3 = 0.75(18 + 1) = 14.25

Q3 is 1/4 of the way between the 14th and 15th


ordered measurements, or
Q3 = 74 + .25(75 - 74) = 74.25
and
IQR = Q3 – Q1 = 74.25 - 65 = 9.25
90-th percentile P90
 The position of 90-th percentile is
0.9(18 + 1)=17.1

The prices ($) of 18 brands of walking shoes:


40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95

P90 = 90 + .10 (95-90) = 90.5


The Variance
• The variance is measure of variability that uses
all the measurements. It measures the average
deviation of the measurements about their
mean.
• Flower petals: 5, 12, 6, 8, 14

45
x= =9
5
4 6 8 10 12 14
The Variance
• The variance of a population of N measurements
is the average of the squared deviations of the
measurements about their mean µ.
∑ ( x − µ ) 2
σ2 = i
N

• The variance of a sample of n measurements is the sum


of the squared deviations of the measurements about their
mean, divided by (n – 1).

∑ ( x − x ) 2
s2 = i
n −1
The Standard Deviation
• In calculating the variance, we squared all of
the deviations, and in doing so changed the
scale of the measurements.
• To return this measure of variability to the
original units of measure, we calculate the
standard deviation, the positive square root of
the variance.

Population standard deviation : σ = σ 2


Sample standard deviation : s = s 2
Two Ways to Calculate the Sample Variance
Use the Definition Formula:
xi xi − x ( xi − x ) 2 ∑ ( x − x ) 2
s =
2 i
5 -4 16 n −1
12 3 9
60
6 -3 9 = = 15
8 -1 1 4
14 5 25
s = s = 15 = 3.87
2

Sum 45 0 60
Two Ways to Calculate the Sample Variance

Use the calculation formula:


xi xi2 (∑ xi )
2
∑ xi −
2

5 25 s2 = n
12 144 n −1
6 36 2
45
8 64 465 −
= 5 = 15
14 196 4
Sum 45 465
s = s 2 = 15 = 3.87
Example- ungrouped data
 Sample: Moisture content (%) of kraft paper are:
6.7, 6.0, 6.4, 6.4, 5.9, and 5.8.

(231.26) − (37.2) 2 6
s= = 0.35
(6 − 1)
 Sample standard deviation, s = 0.35
Using Measures of Center and Spread:
The Empirical Rule
Given a distribution of measurements
that is approximately mound-shaped:
The interval µ ± σ contains approximately 68% of the
measurements.
The interval µ ± 2σ contains approximately 95% of
the measurements.
The interval µ ± 3σ contains approximately 99.7% of
the measurements.
The Empirical Rule: An Example
Measures of Relative Standing
• Where does one particular measurement stand in
relation to the other measurements in the data
set?
• How many standard deviations away from the
mean does the measurement lie? This is measured
by the z-score.

Suppose s = 2. s
x−x 4
z - score= s s
s
x =5 x=9
x = 9 lies z =2 std dev from the mean.
z-Scores
• z-scores between –2 and 2 are not unusual. z-scores
should not be more than 3 in absolute value. z-scores
larger than 3 in absolute value would indicate a
possible outlier.

Outlier Not unusual Outlier


z

-3 -2 -1 0 1 2 3
Somewhat unusual
Example of z-Scores
X z-Score X z-Score
10 -1.28244 10 -0.29204
15 0.625954 500 3.473714
10 -1.28244 10 -0.29204
16 1.007634 16 -0.24593
11 -0.90076 11 -0.28435
17 1.389313 17 -0.23824
14 0.244275 14 -0.2613
13 -0.1374 13 -0.26898
10 -1.28244 10 -0.29204
16 1.007634 16 -0.24593
11 -0.90076 11 -0.28435
17 1.389313 17 -0.23824
14 0.244275 14 -0.2613
13 -0.1374 13 -0.26898
Coefficient of Variation(CV)
 When comparing between data sets with different units
or widely different means, one should use the
coefficient of variation for comparison instead of the
standard deviation.
 The Coefficient of Variation can be written as

s
CV =
x
 We express CV as a percentage by multiplying 100
Skewness
 Skewness measures the degree of asymmetry exhibited
by the data

 The data can exhibits +ve skewness or –ve skewness

 If the mean of the data is greater than its median, the


data is positively skewed; and if the mean of the data is
less than its median, the data is negatively skewed
n

∑ (x − x) 3

 Mathematically,
i
skewness = i =1
ns 3
42
Skewness

Mean Mode Mean Mean


Mode
Median
Median Mode Median

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed
Kurtosis
 Kurtosis measure the peaking of the data relative to the
normal distribution

 Data with high degree of peakeness is said to be


leptokurtic and have the kaurtosis value more than 3

 Flat data has the kurtosis value of less than 3, and it is


called platykurtic n

∑ i
( x − x ) 4

 Mathematically, kurtosis = i =1
ns 4
44
Kurtosis
 Peakedness of a distribution
 Leptokurtic: high and thin
 Mesokurtic: normal in shape
 Platykurtic: flat and spread out

Leptokurtic

Mesokurtic
Platykurtic
Skewness and Kurtosis

46

You might also like