Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Descriptive Statistics: Numerical Methods

Descriptive Statistics
3.1 3.2 3.3 3.5 Describing Central Tendency Measures of Variation Percentiles, Quartiles Grouped Data

Describing Central Tendency


In addition to describing the shape of a distribution, want to describe the data sets central tendency
A measure of central tendency represents the center or middle of the data

Parameters and Statistics


A population parameter is a number calculated from all the population measurements that describes some aspect of the population A sample statistic is a number calculated using the sample measurements that describes some aspect of the sample

Measures of Central Tendency


Mean, Median, Md The average or expected value The value of the middle point of the ordered measurements The most frequent value

Mode, Mo

The Mean
Population X1, X2, , XN Sample x1, x2, , xn

Population Mean
N

Sample Mean
n

i=1

Xi

x=

x
i=1

The Sample Mean


For a sample of size n, the sample mean is defined as
n

x=

x
i =1

x1 + x2 + ... + xn = n

and is a point estimate of the population mean


It is the value to expect, on average and in the long run

Example 3.1: The Car Mileage Case


Example 3.1: Sample mean for first five car mileages from Table 3.1: 30.8, 31.7, 30.1, 31.6, 32.1
5

x1 + x2 + x3 + x4 + x5 5 5 30.8 + 31.7 + 30.1 + 31.6 + 32.1 156.3 x= = = 31.26 5 5 x=


i =1 i

The Median
The median Md is a value such that 50% of all measurements, after having been arranged in numerical order, lie above (or below) it
1. If the number of measurements is odd, the median is the middlemost measurement in the ordering 2. If the number of measurements is even, the median is the average of the two middlemost measurements in the ordering

Example: Car Mileage Case


Example 3.1: First five observations from Table 3.1: 30.8, 31.7, 30.1, 31.6, 32.1 In order: 30.1, 30.8, 31.6, 31.7, 32.1 There is an odd so median is one in middle, or 31.6

The Mode
The mode Mo of a population or sample of measurements is the measurement that occurs most frequently
Modes are the values that are observed most typically Sometimes higher frequencies at two or more values
If there are two modes, the data is bimodal If more than two modes, the data is multimodal

When data are in classes, the class with the highest frequency is the modal class
The tallest box in the histogram

Histogram Describing the 50 Mileages

Relationships Among Mean, Median and Mode

Measures of Variation
Knowing the measures of central tendency is not enough Both of the distributions below have identical measures of central tendency

Measures of Variation
Range Variance Largest minus the smallest measurement The average of the squared deviations of all the population measurements from the population mean The square root of the variance

Standard Deviation

The Range
Largest minus smallest Measures the interval spanned by all the data For Figure 3.13, largest repair time is 5 and smallest is 3 Range is 5 3 = 2 days

Population Variance and Standard Deviation


The population variance (2) is the average of the squared deviations of the individual population measurements from the population mean () The population standard deviation () is the positive square root of the population variance

Variance

For a population of size N, the population variance 2 is:


2 =
2 ( ) x i i =1 N 2 2 2 ( x1 ) + ( x2 ) + L + (x N ) =

For a sample of size n, the sample variance s2 is:


s2 =
2 ( ) x x i i =1 n 2 2 2 ( x1 x ) + ( x2 x ) + L + ( xn x ) =

n 1

n 1

Standard Deviation
Population standard deviation ():

Sample standard deviation (s):

s= s

Example: Chriss Class Sizes This Semester


Data points are: 60, 41, 15, 30, 34 Mean is 36 Variance is:
2 2 2 2 2 ( 60 36 ) + (41 36 ) + (15 36 ) + (30 36 ) + (34 36) = 2

5 576 + 25 + 441 + 36 + 4 1082 = = = 216.4 5 5

Standard deviation is:


= 216.4 = 14.71

Example: Sample Variance and Standard Deviation


Example 3.7: data for first five car mileages from Table 3.1 are 30.8, 31.7, 30.1, 31.6, 32.1 The sample mean is 31.26
s2 =

(x x )
i =1 i

5 1 2 2 2 2 2 ( 30.8 31.26) + (31.7 31.26) + (30.1 31.26) + (31.6 31.26) + (32.1 31.26) = 4 2.572 = = 0.643 4

s = s 2 = 0.643 = 0.8019

The Empirical Rule for Normal Populations


If a population has mean and standard deviation and is described by a normal curve, then
68.26% of the population measurements lie within one standard deviation of the mean: [-, +] 68.26% of the population measurements lie within two standard deviations of the mean: [-2, +2] 68.26% of the population measurements lie within three standard deviations of the mean: [-3, +3]

The Empirical Rule and Tolerance Intervals

Example 3.9: The Car Mileage Case


Continued

68.26% of all individual cars will have mileages in the range [xs] = [31.60.8] = [30.8, 32.4] mpg 95.44% of all individual cars will have mileages in the range [x2s] = [31.61.6] = [30.0, 33.2] mpg 99.73% of all individual cars will have mileages in the range [x3s] = [31.62.4] = [29.2, 34.0] mpg

Chebyshevs Theorem

Let and be a populations mean and standard deviation, then for any value k> 1 At least 100(1 - 1/k2 )% of the population measurements lie in the interval [-k, +k]

Coefficient of Variation
Measures the size of the standard deviation relative to the size of the mean Coefficient of variation =standard deviation/mean 100% Used to:
Compare the relative variabilities of values about the mean Compare the relative variability of populations or samples with different means and different standard deviations Measure risk

Percentiles, Quartiles, and BoxBox-and andWhiskers Displays


For a set of measurements arranged in increasing order, the pth percentile is a value such that p percent of the measurements fall at or below the value and (100-p) percent of the measurements fall at or above the value The first quartile Q1 is the 25th percentile The second quartile (or median) is the 50th percentile The third quartile Q3 is the 75th percentile The interquartile range IQR is Q3 - Q1

Quartiles
Quartiles split the ranked data into 4 segments with an equal number of values per segment
25% 25% 25% 25%

Q1

Q2

Q3

The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger Q2 is the same as the median (50% are smaller, 50% are larger) Only 25% of the observations are greater than the third quartile

Quartile Formulas
Find a quartile by determining the value in the appropriate position in the ranked data, where
First quartile position: Q1 = (n+1)/4

Second quartile position: Q2 = (n+1)/2 (the median position) Third quartile position: Q3 = 3(n+1)/4

where n is the number of observed values

Quartiles
Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9) Q1 is in the (9+1)/4 = 2.5 position of the ranked data so use the value half way between the 2nd and 3rd values, so Q1 = 12.5

Q1 and Q3 are measures of noncentral location Q2 = median, a measure of central tendency

Quartiles
(continued)

Example:
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9) Q1 is in the (9+1)/4 = 2.5 position of the ranked data, so Q1 = 12.5

Q2 is in the (9+1)/2 = 5th position of the ranked data, so Q2 = median = 16

Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data, so Q3 = 19.5

Five Number Summary


1. 2. 3. 4. 5.

The smallest measurement The first quartile, Q1 The median, Md The third quartile, Q3 The largest measurement Displayed visually using a box-andwhiskers plot

Five Number Summary


Example: X Q1 25% Median (Q2) 25% Q3 25% X

minimum
25%

maximum

12

30

45

57

70

Interquartile range = 57 30 = 27

Weighted Means
Sometimes, some measurements are more important than others
Assign numerical weights to the data
Weights measure relative importance of the value

Calculate weighted mean as

w x w
i

i i

where wi is the weight assigned to the ith measurement xi

Example: Weighted Mean


June 2001 unemployment rates by census region
Northeast, 26.9 million in civilian labor force, 4.1% unemployment rate South, 50.6 million, 4.7% unemployment Midwest, 34.7 million, 4.4% unemployment West, 32.5 million, 5.0 unemployment

Want the mean unemployment rate for the US

Example: Weighted Mean

Continued

Want the mean unemployment rate for the U.S. Calculate it as a weighted mean
So that the bigger the region, the more heavily it counts in the mean

The data values are the regional unemployment rates The weights are the sizes of the regional labor forces

Example: Weighted Mean


663.29 = = 4.58% 144.7

Continued

( 26.9 4.1) + (50.6 4.7 ) + (34.7 4.4 ) + (32.5 5.0 ) =


26.9 + 50.6 + 34.7 + 25.5 + 32.5

Note that the unweigthed mean is 4.55%, which underestimates the true rate by 0.03% That is, 0.0003 144.7 million = 43,410 workers

Descriptive Statistics for Grouped Data Data already categorized into a frequency distribution or a histogram is called grouped data Can calculate the mean and variance even when the raw data is not available Calculations are slightly different for data from a sample and data from a population

Mean for Grouped Data Example


Find the arithmetic mean for the following continuous frequency distribution: Class Frequency 0-1 1 1-2 4 2-3 8 3-4 7 4-5 3 5-6 2

Solution for the Example


1 2 3 4 5 6 7 8 9 A Class 0-1 1-2 2-3 3-4 4-5 5-6 Totals Mean B X 0.5 1.5 2.5 3.5 4.5 5.5 C f 1 4 8 7 3 2 25 D fX 0.5 6.0 20.0 24.5 13.5 11.0 75.5 3.02

Applying the formula

fX X= n

= 75.5/25=3.02

Median for Grouped Data


Formula for Median is given by

(n/2) m c Median = L + f
Where L =Lower limit of the median class n = Total number of observations = f m = Cumulative frequency preceding the median class f = Frequency of the median class c = Class interval of the median class

Median for Grouped Data Example


Find the median for the following continuous frequency distribution: Class Frequency 0-1 1 1-2 4 2-3 8 3-4 7 4-5 3 5-6 2

Solution for the Example


Class Frequency 0-1 1 1-2 4 2-3 8 3-4 7 4-5 3 5-6 2 Total 25 Substituting in the formula the relevant values, Median = = 2.9375 Cumulative Frequency 1 5 13 20 23 25

L+

(n/2) m c f

,we have Median =

(25/ 2) 5 2+ 1 8

Mode for Grouped Data


d1 c Mode = L + d1 + d 2
Where L =Lower limit of the modal class

d1 = f1 f0
f1
f0

d2 = f1 f2

= Frequency of the modal class = Frequency preceding the modal class = Frequency succeeding the modal class C = Class Interval of the modal class

f2

Mode for Grouped Data Example


Example: Find the mode for the following continuous frequency distribution: Class Frequency 0-1 1 1-2 4 2-3 8 3-4 7 4-5 3 5-6 2

Solution for the Example


Class Frequency 0-1 1 1-2 4 2-3 8 3-4 7 4-5 3 5-6 2 Total 25

d1 c Mode = L + d1 + d 2
L=2 d1 = f1 f0 = 8-4 = 4

d2 = f1 f 2 = 8-7 = 1
4 2 + 1 C = 1 Hence Mode = 5

= 2.8

Standard Deviation for Grouped Data


The standard deviation for sample data, based on frequency distribution is given by

f(X X )

S= which is used to estimate the Population n 1 Standard Deviation . Here


X =

fX n

n is the Sample Size =

, X =Mid Point of each class

Standard Deviation for Grouped DataData-Example


Frequency Distribution of Return on Investment of Mutual Funds Return on Investment 5-10 10-15 15-20 20-25 25-30 Total Number of Mutual Funds 10 12 16 14 8 60

Solution for the Example

Solution for the Example


From the spreadsheet of Microsoft Excel in the previous slide, it is easy to see Mean = X =

fX
n

=1040/60=17.333(cell F10),

Standard Deviation = S = (Cell H12)

f(X X)
n 1

2448.33 59

= 6.44

Practice Problem
Q. 3.47 pp. 154, calculate sample mean, variance and s.d.
Age (Years) 28-32 33-37 38-42 43-47 48-52 53-57 58-62 63-67 68-72 73-77 Frequency 1 3 3 13 14 12 9 1 3 1

Solution
Midpoint 30 35 40 45 50 55 60 65 70 75 1 3 3 13 14 12 9 1 3 Freq 30 105 120 585 700 660 540 65 210 M*f (M mean)**2 462.25 272.25 132.25 42.25 2.25 12.25 72.25 182.25 342.25 F*diff^2 462.25 816.75 396.75 549.25 31.5 147 650.25 182.25 1026.75 552.25
s = 9.033835

1 75 552.25 variance = 81.61017 sample mean = 51.5

You might also like