Descriptive Statistics

Descriptive Statistics
Quantitative(variable)
Discrete (no. of
customers, no of
claims)
Continuous (salary,
price)
Qualitative(Attribute)
Ordinal (customer
satisfaction, efficiency
of workers, bond
rating)
Nominal (sex,
nationality, eye color)
7/3/2013 2 Descriptive Statistics
Data
Primary
Secondary
Data
Time series
(unemployment
rate, GDP)
Cross Sectional
(queue length in
different SBI
branches)
Definition
Primary Data

Collected from source
directly
Collected under the control
and supervision of
investigation

Secondary Data

Not collected by the
investigator
Derived from the other
sources
Interview Method
Questionnaire Method
Observation Method
Methods of collecting Primary Data

Diagram Presentation
Diagram
Line (time
series)
Simple Multiple
Bar
Vertical (time
series)
Horizontal
(cross
sectional
Component Subdivided
Pie
When data are collected in original form,
they are called raw data.
When the raw data is organized into a
frequency distribution, the frequency will
be the number of values in a specific class
of the distribution (grouped data).
7/3/2013 Descriptive Statistics 7
Data Table : Compressive Strength of 80
Aluminum Lithium Alloy
105 221 183 186 121 181 180 143
97 154 153 174 120 168 167 141
245 228 174 199 181 158 176 110
163 131 154 115 160 208 158 133
207 180 190 193 194 133 156 123
134 178 76 167 184 135 229 146
218 157 101 171 165 172 158 169
199 151 142 163 145 171 148 158
160 175 149 87 160 237 150 135
196 201 200 176 150 170 118 149
Stem-And-Leaf
Stem leaf frequency
7 6 1
8 7 1
9 7 1
10 5 1 2
11 5 0 8 3
12 1 0 3 3
13 4 1 3 5 3 5 6
14 2 9 5 8 3 1 6 9 8
15 4 7 1 3 4 0 8 8 6 8 0 8 12
16 3 0 7 3 0 5 0 8 7 9 10
17 8 5 4 4 1 6 2 1 0 6 10
18 0 3 6 1 4 1 0 7
19 9 6 0 9 3 4 6
20 7 1 0 8 4
21 8 1
22 1 8 9 3
23 7 1
24 5 1

class width=
upper class boundary-lower class boundary
Terms Associated with a
Grouped Frequency Distribution
Class Mark or Mid-Value
class marks are the midpoints of the class
boundaries

Class mark=
1/2(upper class boundary+lower class boundary)

FD=Class frequency/class width
It gives number of observations in a class of width
one
Use- When class widths are not equal, frequency
density is plotted on the y-axis to draw Histogram

Frequency Density
RF=Class frequency/total frequency
Relative Frequency
Visualizing Data
The three most commonly used
graphs in research are:
The histogram.
The frequency polygon.
The cumulative frequency graph or
ogive

Characteristic Definition / Interpretation
Central Tendency Where are the data values concentrated?
What seem to be typical or middle data
values?
Key Characteristics
Dispersion How much variation is there in the data?
How spread out are the data values?
Are there unusual values?
Shape Are the data values distributed
symmetrically? Skewed? Sharply peaked?
Flat? Bimodal?
Measure Formula Excel Formula Pro Con
Mean
(Raw
data)
=AVERAGE(Data)
Familiar and
uses all the
sample
information.
NA to
extreme
values and
open class
Measures
Mean
(Groupe
d data)
=AVERAGE(Data)
Familiar and
uses all the
sample
information.
NA to
extreme
values and
open class
7/3/2013 Descriptive Statistics
=
=
=
k
i
i
k
i
i i
f
f x
x
1
1
n
x
x
n
i
i
=
=
1
16
Median
Middle value
in sorted
array
=MEDIAN(Data)
Robust
when
extreme
data values
exist.
Statistical
procedure
s for
median
are
complex
Measures
Mode
Most
frequently
occurring
data value
=MODE(Data)
Useful for
attribute
data or
discrete
data with a
small range.
May not be
unique,
and is not
helpful for
continuous
data.
Statistic is descriptive measure derived from a sample
(n items).
Parameter is descriptive measure derived from a
population (N items).
Population vs Sample
Characteristics
Calculation of Mean

= =
= =
=
=
= = =
= = =
= = =
= = =
k
i
i i
k
i
i
k
i
i i
k
i
i
n
i
i
N
i
i
f N f x
N
mean Sample x
f N f x
N
mean Population
data Grouped
size sample n x
n
mean Sample x
size Population N x
N
mean Population
data Raw
1 1
1 1
1
1
;
1

;
1

:
;
1

;
1

:

Seventy efficiency apartments were randomly sampled in a small
college town. The monthly rent prices for these apartments are
listed below.

Sample Mean
Example: Apartment Rents
Sample Mean
34, 356
490.80
70
i
x
x
n
= = =

Consider the following n = 6 data values:
11 12 15 17 21 32
What is the median?
M = (x
3
+x
4
)/2 = (15+17)/2 = 16
11 12 15 16 17 21 32
For even n, Median =
/ 2 ( / 2 1)
2
n n
x x
+
+
n/2 = 6/2 = 3 and n/2+1 = 6/2 + 1 = 4
Calculation of Median (n is even)
Consider the following n = 7 data values:
11 12 15 17 21 32 38
What is the median?
11 12 15 17 21 32 38
(n+1)/2 = 8/2 = 4
Calculation of Median (n is odd)
For odd n, Median =
( 1) / 2 n
x
+
Trimmed Mean
It is obtained by deleting a percentage of the
smallest and largest values from a data set and then
computing the mean of the remaining values.
For example, the 5% trimmed mean is obtained by
removing the smallest 5% and the largest 5% of the
data values and then computing the mean of the
remaining values.
Another measure, sometimes used when extreme
values are present, is the trimmed mean.
A bimodal distribution refers to the shape of the
histogram rather than the mode of the raw data.
Occurs when dissimilar populations are combined in
one sample. For example,
Mode
Percentiles are data that have been divided into 100 groups and how the
data spread over an interval from smallest to largest value
For example, you score in the 83
rd
percentile on a
standardized test. That means that 83% of the test-
takers scored below you.
Deciles are data that have been divided into 10 groups.
Quartiles are data that have been divided into 4
groups.
Percentiles and Quartiles
In general by pth order quantile or fractile (Zp ), we mean that p
Proportion of the total observations lie below
Put p=1/4, 2/4, 3/4, get quartiles
Put p=1/10,2/10, , 9/10, get deciles
Put p=1/100, 2/100, , 99/100, get percentiles

Step 1. Sort the observations.
Step 2. Calculate np ; n=no of observations.
Percentiles and Quartiles
7/3/2013 Descriptive Statistics
Step 3: If np is not an integer, consider the next integer value as
the position else take both the integer and the next integer
as the positions; take their mean
27
Third Quartile
Third quartile = 75th percentile
np = (75/100)70 = 52.5 = 53
Third quartile = 525
Dispersion
Describes how similar a set of observations
are to each other
or
the degree of deviation (spread) of a set of
data from their central value
In general, the more spread out a distribution is,
the larger the measure of dispersion will be
Measures of Dispersion
There are five main measures of dispersion:
Range
Mean Deviation
Mean squared deviation (variance)
Root mean squared deviation (Standard
Deviation)
Inter-quartile range (IQR)

Range x
max
x
min

=MAX(Data)-
MIN(Data)
Easy to
calculate
Sensitive to
extreme data
values.
Measures
Mean
Deviation
=ABS(expr)
Measures
deviation
accurately
Further
algebraic
treatment is
not possible
=

n
i
i
x x
n
1
1
Populatio
n
Variance
=VARP(array)
Important
measure
Overestim
ates the
error
Measures

Sample
Variance
=VAR(array)
Important
measure
Overestim
ates the
error
=
=
N
i
i
x
N
1
2 2
) (
1
o
=
n
i
i
x x
n
s
1
2 2
) (
1
1
REMEMBER
2
1
2
1
2 2
1 1
1
) (
1
1

x
n
n
f x
n
f x x
n
s
data grouped For
k
i
i i
k
i
i i
=

= =
Populatio
n
Standard
Deviation
=STDEVP(array
)
Best
measure
Measures
Sample
Standard
deviation
=STDEV(array)
Best
measure
2
o o =
2
s s =
Inter-quartile Range
The inter-quartile range (IQR) is defined as the
difference of the first and third quartiles
divided by two
The first quartile is the 25
th
percentile
The third quartile is the 75
th
percentile
IQR = (Q
3
- Q
1
)
When To Use the SIR
It is the range for the middle 50% of the data
The SIR is often used with skewed data as it is
insensitive to the extreme scores
The SIR is used with open end distribution

Interquartile Range
3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80
Coefficient of Variation (CV)
Relative measure (unit free) used for the purpose
of comparison of variability when
(i) two variables of different units are compared
(ii) two variables of same unit with varying mean
are compared
Relative Measure=absolute measure/avg. *100

100
s
CV
x
=
| | | |
| | = =
| |
\ . \ .
54.74
100 % 100 % 11.15%
490.80
s
x
2
2996.16 54.74 s s = = =
the standard
deviation is
about 11%
of the mean
Variance
Standard Deviation
Coefficient of Variation
Sample Variance, Standard Deviation,
And Coefficient of Variation
=
=
=
n
i
i
x x
n
s
1
2 2
16 . 2996 ) (
1
1
Skewness
Skew is a measure of symmetry in the
distribution of data
Positive Skew
Negative Skew
Normal (skew =
0)
Measure of Skew
Skewness is a unit-free measure of shape of any
frequency distribution.
The coefficient compares two samples measured in
different units or one sample with a known reference
distribution (e.g., symmetric normal distribution).
Calculate the samples skewness coefficient
Nature of Skewness
If , distribution has a positive skewness or
is right skewed
If , distribution has a negative skewness
or is left skewed
If , distribution is symmetrical
0
1
> g
0
1
< g
0
1
= g
Kurtosis is the relative length of the tails and the
degree of concentration in the center.
Consider three kurtosis prototype shapes.
Kurtosis
Kurtosis
When the distribution is normally distributed, its
kurtosis equals 3 and it is said to be mesokurtic
When the distribution is less spread out than
normal, its kurtosis is greater than 3 and it is said
to be leptokurtic
When the distribution is more spread out than
normal, its kurtosis is less than 3 and it is said to
be platykurtic
The z-score is often called the standardized value.

It denotes the number of standard deviations a data
value x
i
is from the mean.

An observations z-score is a measure of the relative
location of the observation in a data set.

z-Scores
s
x x
z
i
i

=
Excels STANDARDIZE function can be used to
compute the z-score.
425 490.80
1.20
54.74
i
x x
z
s

= = =
z-Scores
Standardized Values for Apartment Rents


Chebyshevs Theorem
At least (1 - 1/z
2
) of the items in any data set will be
within z standard deviations of the mean, where z is
any value greater than 1.
Chebyshevs theorem requires z > 1, but z need not
be an integer.
At least of the data values must be
within of the mean.
75%
z = 2 standard deviations
Chebyshevs Theorem
within of the mean.
89%
within of the mean.
94%
Empirical Rule
For data having a bell-shaped
distribution:
of the values of a normal random variable
are within of its mean.
68.26%
+/- 1 standard deviation
95.44%
+/- 2 standard deviations
99.72%
+/- 3 standard deviations
Empirical Rule
x
3o 1o
2o
+ 1o
+ 2o
+ 3o

68.26%
95.44%
99.72%
Detecting Outliers
An outlier is an unusually small or unusually large
value in a data set.
A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
Box Plot
A box plot is a graphical summary of to identify
outliers.
A key to the development of a box plot is the
computation of the median and the quartiles Q
1
and
Q
3
.
Box Plot
Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325
Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(80) = 645
The lower limit is located 1.5(IQR) below Q1
The upper limit is located 1.5(IQR) above Q3.
There are no outliers (values less than 325 or
greater than 645) in the apartment rent data.
Box Plot

Whiskers (dashed lines) are drawn from the ends of the box
to the smallest and largest data values inside the limits.
400 425 450 475 500 525 550 575 600 625
Smallest value
inside limits = 425
Largest value
inside limits = 615
Weighted Mean
When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean.
In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
When data vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
Weighted Mean
i i
i
wx
x
w
=
where:
x
i
= value of observation i
w
i
= weight for observation i
Mean for Grouped Data
i i
f M
x
n
=
N
M f
i i
=
where:
f
i
= frequency of class i
M
i
= midpoint of class i
Sample Data
Population Data
Sample Mean for Grouped Data
Sample Mean for Grouped Data
This approximation
differs by $2.41 from
the actual sample
mean of $490.80.
34, 525
493.21
70
x = =
Variance for Grouped Data
s
f M x
n
i i
2
2
1
=

( )
o

2
2
=

f M
N
i i
( )
For sample data
For population data
Sample Variance for Grouped Data
3,017.89 54.94 s = =
s
2
= 208,234.29/(70 1) = 3,017.89
This approximation differs by only $.20
from the actual standard deviation of $54.74.
Sample Variance
Sample Standard Deviation
Sample Variance for Grouped Data
ACKNOWLEDGEMENT
1) Statistics for Management by Levin & Rubin ( Prentice Hall )

2) Business Statistics by Aczel and Soundarpardian ( Pearson )

3) Business Statistics by Anderson, Sweeney & Williams ( Cengage )

4) Applied Statistics in Business & Economics by Doane ( McGraw-Hill )

Descriptive Statistics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Descriptive Statistics

7/3/2013 Descriptive Statistics 19

7/3/2013 Descriptive Statistics 21

You might also like