Descriptive Statistics

Blind men and an elephant
Things aren’t always what we think!

Six blind men go to observe an elephant. One feels the side and thinks the
elephant is like a wall. One feels the tusk and thinks the elephant is a like a
spear. One touches the squirming trunk and thinks the elephant is like a
snake. One feels the knee and thinks the elephant is like a tree. One
touches the ear, and thinks the elephant is like a fan. One grasps the tail
and thinks it is like a rope.
They argue long and loud and though each was partly in the
right, all were in the wrong.
What Is Statistics?
• Statistics is the methodology of extracting

information from a data set.
• To do good statistical analysis, you must
– Find the right data.
– Use the appropriate statistical tools.
– Clearly communicate the numerical information into
written language.
Data: Singular or Plural?
Data is the plural form of the Latin

datum
(a “given” fact)
Data Definitions
A Small Multivariate Data Set

8 Subjects 5 Variables
Data Definitions
(Figure 2.1)
Data Definitions
Numerical Data
Numerical or quantitative data arise from counting or some kind
of mathematical operation.
For example,
- Number of auto insurance claims filed in
March (e.g., X = 114 claims).
- Ratio of profit to sales for last quarter
(e.g., X = 0.0447).
Can be broken down into two types – discrete or continuous

data.
Data Definitions
Discrete Data
A numerical variable with a countable number of values that can
be represented by an integer (no fractional values).
For example,
- Number of Medicaid patients (e.g., X = 2).
- Number of takeoffs at O’Hare (e.g., X = 37).
Data Definitions
Continuous Data
A numerical variable that can have any value within an interval

(e.g., length, weight, time, sales, price/earnings ratios).
Any continuous interval contains infinitely many possible values

(e.g., 426 < X < 428).
Types of Data
Cross-sectional data
– Data collected by recording a characteristic of many
subjects at the same point in time, or without regard to
differences in time.
– Subjects might include individuals, households, firms,
industries, regions, and countries.
– The survey data from the Introductory Case is an example
of cross-sectional data.
Types of Data
• Time series data
– Data collected by recording a characteristic of a subject
over several time periods.
– Data can include daily, weekly, monthly, quarterly, or
annual observations.
– This graph plots the
U.S. GDP growth rate
from 1980 to 2010 - it
is an example of time
series data.
Time Series Data
Time Series Data
Each observation in the sample represents a different equally

spaced point in time (e.g., years, months, days).
Periodicity may be annual, quarterly, monthly, weekly, daily,
hourly, etc.
We are interested in trends and patterns over time (e.g., annual
growth in consumer debit card use from 2001 to 2008).
Cross-Sectional Data
Cross-sectional Data
Each observation represents a different individual unit (e.g.,
person) at the same point in time (e.g., monthly VISA balances).
We are interested in
- variation among observations or in
- relationships.
We can combine the two data types to get pooled cross-

sectional and time series data.
Variables and Scales of Measurement
• A variable is the general characteristic being observed on

an object of interest.
• Types of Variables
• Qualitative – gender, race, political affiliation
• Quantitative – test scores, age, weight
• Discrete
• Continuous
Types of Quantitative Variables

– Discrete
• A discrete variable assumes a countable number of
distinct values.
• Examples: Number of children in a family, number of
points scored in a basketball game.
LO 1.4
Types of Quantitative Variables

Continuous
• A continuous variable can assume an infinite
number of values within some interval.
• Examples: Weight, height, investment return.
LO 1.4
Scales of Measure
- Nominal
Qualitative Variables
- Ordinal
- Interval
Quantitative Variables
- Ratio
LO 1.4
Levels of Measurements
Level of
Characteristics Example
Measurement
Eye color (blue,
Nominal Categories only
brown, green, hazel)
Bond ratings (Aaa,
Ordinal Rank has meaning
Aab, C, D, F, etc.)
Distance has Temperature (57o
Interval
meaning Celsius)
Meaningful zero Accounts payable
Ratio
exists ($21.7 million)
Nominal Measurement
Nominal data merely identify a category.
Nominal data are qualitative, attribute, categorical or
classification data (e.g., Small, Medium, Large, Extra Large, etc.,).
Nominal data are usually coded numerically, codes are
arbitrary (e.g., 36 = Small, 40 = Medium, 42 = Large, 44 = Extra
Large).
Only mathematical operations are counting (e.g., frequencies)
and simple statistics.
Ordinal Measurement
Ordinal data codes can be ranked
(e.g., 1 = Frequently, 2 = Sometimes, 3 = Rarely, 4 = Never).
Distance between codes is not meaningful

(e.g., distance between 1 and 2, or between 2 and 3, or between
3 and 4 lacks meaning).
Many useful statistical tests exist for ordinal data. Especially
useful in social science, marketing and human resource
research.
Interval Measurement
Data cannot only be ranked, but also have meaningful intervals
between scale points
(e.g., difference between 60F and 70F is same as difference
between 20F and 30F).
Since intervals between numbers represent distances,
mathematical operations can be performed (e.g., average).
Zero point of interval scales is arbitrary, so ratios are not

meaningful (e.g., 60F is not twice as warm as 30F).
Level of Measurement
Ratio Measurement
Ratio data have all properties of nominal, ordinal and interval
data types and also possess a meaningful zero (absence of
quantity being measured).
Because of this zero point, ratios of data values are meaningful

(e.g., $20 million profit is twice as much as $10 million).
Zero does not have to be observable in the data, it is an

absolute reference point.
Use the following procedure to recognize data

types
Question If “Yes”
Is there a meaningful zero Ratio data (all statistical operations are
point? allowed)
Are intervals between scale Interval data (common statistics allowed,
points meaningful? e.g., means and standard deviations)
Do scale points represent Ordinal data (restricted to certain types of
rankings? nonparametric statistical tests)
Are there discrete categories? Nominal data (only counting allowed, e.g.
finding the mode)
The Interval Scale

• Data may be categorized and ranked with respect to some
characteristic or trait.
• Differences between interval values are equal and
meaningful. Thus the arithmetic operations of addition
and subtraction are meaningful.
• No “absolute 0” or starting point defined. Meaningful
ratios may not be obtained.
LO 1.4
• The Interval Scale

– For example, consider the Fahrenheit
scale of temperature.
– This scale is interval because the data
are ranked and differences (+ or -)
may be obtained.
– But there is no “absolute 0” (What
does 00 F mean?)
LO 1.4
The Ratio Scale
• The strongest level of measurement.
• Ratio data may be categorized and ranked with
respect to some characteristic or trait.
• Differences between interval values are equal and
meaningful.
• There is an “absolute 0” or defined starting point.
“0” does mean “the absence of …” Thus, meaningful
ratios may be obtained.
LO 1.4
Overview of Statistics
Statistics
Collecting and Making Inferences

Describing Data from Samples
Sampling Visual Numerical Probability Estimating Testing Regression Quality

and Surveys Displays Summaries Models Parameters Hypotheses and Trends Control
Branches of Statistics?
• Two branches of statistics

– Descriptive Statistics
• collecting, organizing, and presenting the data.
– Inferential Statistics
• drawing conclusions about a population based on
sample data from that population.
LO 1.2
Population and Sample
• Population
– Consists of all items of interest.
• Sample
– A subset of the population.
• A sample statistic is calculated from the sample data
and is used to make inferences about the population
parameter.
LO 1.2
The Need for Sampling
Reasons for sampling from the population

• Too expensive to gather information on the entire
population
• Often impossible to gather information on the entire
population
Sample or Census?
A sample involves looking only at some items selected

from the population.
A census is an examination of all items in a defined

population.
Parameters and Statistics?
• Statistics are computed from a sample of n items, chosen
from a population of N items.
• Statistics can be used as estimates of parameters found in
the population.
• Symbols are used to represent population parameters and
sample statistics.
Parameters or Statistics
Finite or Infinite?
A population is finite if it has a definite size, even if its size is
unknown.
A population is infinite if it is of arbitrarily large size.
Rule of Thumb: A population may be treated as infinite when N
is at least 20 times n (i.e., when N/n ≥ 20)
N n
Here,
N/n ≥ 20
Descriptive Statistics
Numerical Description
Central Tendency
Dispersion
Numerical Description
Statistics are descriptive measures derived from a
sample (n items).
Parameters are descriptive measures derived from a
population (N items).
34
Central Tendency
• The central tendency is the middle or typical values

of a distribution.
• Central tendency can be assessed using a dot plot,

histogram or more precisely with numerical
statistics.
Central Tendency
Mean
• A familiar measure of central tendency.
Population Mean Sample Mean

N n
 xi  xi
i =1
= x = i =1
N n
• In Excel, use function =AVERAGE(Data) where Data is

an array of data values.
Central Tendency
Characteristics of the Mean
• Arithmetic mean is the most familiar average.
• Affected by every sample item.
• The balancing point or fulcrum for the data.
Central Tendency
Median
• The median (M) is the 50th percentile or midpoint of the
sorted sample data.
• M separates the upper and lower half of the sorted
observations.
• If n is odd, the median is the middle observation in the data

array.
• If n is even, the median is the average of the middle two
observations in the data array.
Central Tendency
Median
Central Tendency
Mode
• The most frequently occurring data value.
• Similar to mean and median if data values occur
often near the center of sorted data.
• May have multiple modes or no mode.
Central Tendency
Mode
• A bimodal distribution refers to the shape of the histogram
rather than the mode of the raw data.
• Occurs when dissimilar populations are combined in one
sample. For example,
Central Tendency
Skewness
Compare mean and median or look at histogram
to determine degree of skewness.
Dispersion
Variation is the “spread” of data points about the center of
the distribution in a sample. Consider the following
measures of dispersion:
Measures of Variation
Statistic Formula Excel Pro Con
Sensitive to
=MAX(Data)-
Range xmax – xmin MIN(Data)
Easy to calculate extreme data
values.
1 Plays a key role
Variance
(s2) n
 ( xi − x ) 2 =VAR(Data) in mathematical
Non-intuitive
meaning.
statistics.
Dispersion
Most common
measure. Uses Non-
Standard 1
deviation (s) n
 ( xi − x ) 2 =STDEV(Data) same units as the intuitive
raw data ($ , £, Rs, meaning.
etc.).
Measures relative Requires

Coefficient of s variation in percent non-
100  None
variation (CV) x so can compare negative
data sets. data.
Dispersion
Mean n
absolute  xi − x =AVEDEV(Data)
Easy to
Lacks “nice”
theoretical
i =1
deviation understand.
n properties.
(MAD)
Dispersion
Variance
• The population variance (s2) is N
 ( xi −  )
2
defined as the sum of squared
deviations around the mean  s2 = i =1
divided by the population size. N
• The sample variance ( s2) is

1
defined as the sum of squared s =  ( xi − x ) 2
2
deviations around the mean n
divided by the sample size.
Dispersion
Standard Deviation
• The square root of the variance.
• Explains how individual values in a data set vary from
the mean.
• Units of measure are the same as X.
Population N Sample
 ( xi −  )
2 1
standard
s = i =1
standard
n
 ( xi − x ) 2
deviation N deviation
Descriptive Statistics
Standardized Data
Percentiles, Quartiles and Box Plots
Standardized Data
Chebyshev’s Theorem
• Developed by mathematicians Jules Bienaymé
(1796-1878) and Pafnuty Chebyshev (1821-1894).
• For any population with mean  and standard

deviation s, the percentage of observations that lie
within k standard deviations of the mean must be at
least 100[1 – 1/k2].
Standardized Data
Chebyshev’s Theorem
• For k = 2 standard deviations,
100[1 – 1/22] = 75%
• So, at least 75.0% will lie within  + 2s
• For k = 3 standard deviations,
100[1 – 1/32] = 88.9%
• So, at least 88.9% will lie within  + 3s
• Although applicable to any data set, these limits tend to be

too wide to be useful.
Standardized Data
The Empirical Rule
• The normal or Gaussian distribution was named for Karl Gauss
(1771-1855).
• The normal distribution is symmetric and is also known as the
bell-shaped curve.
• The Empirical Rule states that for data from a normal
distribution, we expect that for
k = 1 about 68.26% will lie within  + 1s
Standardized Data
The Empirical Rule
• Distance from the mean is measured in terms of the
number of standard deviations.
Note: no
upper bound
is given.
Data values
outside
 + 3s
are rare.
Standardized Data
Defining a Standardized Variable

A standardized variable (Z) redefines each observation
in terms the number of standard deviations from the
mean.
Standardization formula for a xi − 
zi =
population: s
Standardization formula for xi − x

zi =
a sample: s
Percentiles
• Percentiles are data that have been divided into 100
groups. For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the test-takers
scored below you.
• Deciles are data that have been divided into 10 groups.
• Quintiles are data that have been divided into
5 groups.
• Quartiles are data that have been divided into
4 groups.
Quartiles
Quartiles are scale points that divide the sorted data into
four groups of approximately equal size.
Q1 Q2 Q3
Lower 25% | Second 25% | Third 25% | Upper 25%
The three values that separate the four groups are called Q1, Q2,
and Q3, respectively.
Quartiles
The first quartile Q1 is the median of the data values
below Q2, and the third quartile Q3 is the median of the
data values above Q2.
Q1 Q2 Q3
Lower 25% | Second 25% | Third 25% | Upper 25%
For first half of data, 50% For second half of data,

above, 50% above,
50% below Q1. 50% below Q3.
Correlation
Correlation Coefficient
The sample correlation coefficient is a statistic that describes
the degree of linearity between paired observations on two
quantitative variables X and Y.
n
 (x i − x )( yi − y )
r= i =1
n n
 ( xi − x )
i =1
2
 i
( y
i =1
− y ) 2
Correlation
Correlation Coefficient
Its range is -1 ≤ r ≤ +1.
Excel’s formula =CORREL(Xdata, Ydata)
Correlation
Illustration of Correlation Coefficients

Descriptive Statistics

Uploaded by

Copyright:

Available Formats

You might also like

Descriptive Statistics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Blind men and an elephant

Things aren’t always what we think!

• Statistics is the methodology of extracting

Data is the plural form of the Latin

A Small Multivariate Data Set

Can be broken down into two types – discrete or continuous

A numerical variable that can have any value within an interval

Any continuous interval contains infinitely many possible values

Time Series Data

Each observation in the sample represents a different equally

We can combine the two data types to get pooled cross-

• A variable is the general characteristic being observed on

Types of Quantitative Variables

Types of Quantitative Variables

Distance between codes is not meaningful

Zero point of interval scales is arbitrary, so ratios are not

Because of this zero point, ratios of data values are meaningful

Zero does not have to be observable in the data, it is an

Use the following procedure to recognize data

The Interval Scale

• The Interval Scale

Collecting and Making Inferences

Sampling Visual Numerical Probability Estimating Testing Regression Quality

• Two branches of statistics

Reasons for sampling from the population

A sample involves looking only at some items selected

A census is an examination of all items in a defined

• The central tendency is the middle or typical values

• Central tendency can be assessed using a dot plot,

Population Mean Sample Mean

• In Excel, use function =AVERAGE(Data) where Data is

• If n is odd, the median is the middle observation in the data

Measures relative Requires

• The sample variance ( s2) is

• For any population with mean  and standard

• Although applicable to any data set, these limits tend to be

Defining a Standardized Variable

Standardization formula for xi − x

Lower 25% | Second 25% | Third 25% | Upper 25%

For first half of data, 50% For second half of data,

You might also like