Descriptive Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Blind men and an elephant

Things aren’t always what we think!


Six blind men go to observe an elephant. One feels the side and thinks the
elephant is like a wall. One feels the tusk and thinks the elephant is a like a
spear. One touches the squirming trunk and thinks the elephant is like a
snake. One feels the knee and thinks the elephant is like a tree. One
touches the ear, and thinks the elephant is like a fan. One grasps the tail
and thinks it is like a rope.

They argue long and loud and though each was partly in the
right, all were in the wrong.
What Is Statistics?

• Statistics is the methodology of extracting


information from a data set.
• To do good statistical analysis, you must
– Find the right data.
– Use the appropriate statistical tools.
– Clearly communicate the numerical information into
written language.
Data: Singular or Plural?

Data is the plural form of the Latin


datum
(a “given” fact)
Data Definitions

A Small Multivariate Data Set


8 Subjects 5 Variables
Data Definitions

(Figure 2.1)
Data Definitions
Numerical Data
Numerical or quantitative data arise from counting or some kind
of mathematical operation.
For example,
- Number of auto insurance claims filed in
March (e.g., X = 114 claims).
- Ratio of profit to sales for last quarter
(e.g., X = 0.0447).

Can be broken down into two types – discrete or continuous


data.
Data Definitions
Discrete Data
A numerical variable with a countable number of values that can
be represented by an integer (no fractional values).
For example,
- Number of Medicaid patients (e.g., X = 2).
- Number of takeoffs at O’Hare (e.g., X = 37).
Data Definitions

Continuous Data

A numerical variable that can have any value within an interval


(e.g., length, weight, time, sales, price/earnings ratios).

Any continuous interval contains infinitely many possible values


(e.g., 426 < X < 428).
Types of Data

Cross-sectional data
– Data collected by recording a characteristic of many
subjects at the same point in time, or without regard to
differences in time.
– Subjects might include individuals, households, firms,
industries, regions, and countries.
– The survey data from the Introductory Case is an example
of cross-sectional data.
Types of Data
• Time series data
– Data collected by recording a characteristic of a subject
over several time periods.
– Data can include daily, weekly, monthly, quarterly, or
annual observations.
– This graph plots the
U.S. GDP growth rate
from 1980 to 2010 - it
is an example of time
series data.
Time Series Data

Time Series Data

Each observation in the sample represents a different equally


spaced point in time (e.g., years, months, days).
Periodicity may be annual, quarterly, monthly, weekly, daily,
hourly, etc.
We are interested in trends and patterns over time (e.g., annual
growth in consumer debit card use from 2001 to 2008).
Cross-Sectional Data
Cross-sectional Data
Each observation represents a different individual unit (e.g.,
person) at the same point in time (e.g., monthly VISA balances).

We are interested in
- variation among observations or in
- relationships.

We can combine the two data types to get pooled cross-


sectional and time series data.
Variables and Scales of Measurement

• A variable is the general characteristic being observed on


an object of interest.
• Types of Variables
• Qualitative – gender, race, political affiliation
• Quantitative – test scores, age, weight
• Discrete
• Continuous
Variables and Scales of Measurement

Types of Quantitative Variables


– Discrete
• A discrete variable assumes a countable number of
distinct values.
• Examples: Number of children in a family, number of
points scored in a basketball game.

LO 1.4
Variables and Scales of Measurement

Types of Quantitative Variables


Continuous
• A continuous variable can assume an infinite
number of values within some interval.
• Examples: Weight, height, investment return.

LO 1.4
Variables and Scales of Measurement

Scales of Measure

- Nominal
Qualitative Variables
- Ordinal

- Interval
Quantitative Variables
- Ratio

LO 1.4
Levels of Measurements

Level of
Characteristics Example
Measurement
Eye color (blue,
Nominal Categories only
brown, green, hazel)
Bond ratings (Aaa,
Ordinal Rank has meaning
Aab, C, D, F, etc.)
Distance has Temperature (57o
Interval
meaning Celsius)
Meaningful zero Accounts payable
Ratio
exists ($21.7 million)
Levels of Measurements
Nominal Measurement
Nominal data merely identify a category.
Nominal data are qualitative, attribute, categorical or
classification data (e.g., Small, Medium, Large, Extra Large, etc.,).
Nominal data are usually coded numerically, codes are
arbitrary (e.g., 36 = Small, 40 = Medium, 42 = Large, 44 = Extra
Large).
Only mathematical operations are counting (e.g., frequencies)
and simple statistics.
Levels of Measurements

Ordinal Measurement
Ordinal data codes can be ranked
(e.g., 1 = Frequently, 2 = Sometimes, 3 = Rarely, 4 = Never).

Distance between codes is not meaningful


(e.g., distance between 1 and 2, or between 2 and 3, or between
3 and 4 lacks meaning).
Many useful statistical tests exist for ordinal data. Especially
useful in social science, marketing and human resource
research.
Levels of Measurements

Interval Measurement
Data cannot only be ranked, but also have meaningful intervals
between scale points
(e.g., difference between 60F and 70F is same as difference
between 20F and 30F).
Since intervals between numbers represent distances,
mathematical operations can be performed (e.g., average).

Zero point of interval scales is arbitrary, so ratios are not


meaningful (e.g., 60F is not twice as warm as 30F).
Level of Measurement
Ratio Measurement
Ratio data have all properties of nominal, ordinal and interval
data types and also possess a meaningful zero (absence of
quantity being measured).

Because of this zero point, ratios of data values are meaningful


(e.g., $20 million profit is twice as much as $10 million).

Zero does not have to be observable in the data, it is an


absolute reference point.
Levels of Measurements

Use the following procedure to recognize data


types
Question If “Yes”
Is there a meaningful zero Ratio data (all statistical operations are
point? allowed)
Are intervals between scale Interval data (common statistics allowed,
points meaningful? e.g., means and standard deviations)
Do scale points represent Ordinal data (restricted to certain types of
rankings? nonparametric statistical tests)
Are there discrete categories? Nominal data (only counting allowed, e.g.
finding the mode)
Variables and Scales of Measurement

The Interval Scale


• Data may be categorized and ranked with respect to some
characteristic or trait.
• Differences between interval values are equal and
meaningful. Thus the arithmetic operations of addition
and subtraction are meaningful.
• No “absolute 0” or starting point defined. Meaningful
ratios may not be obtained.

LO 1.4
Variables and Scales of Measurement

• The Interval Scale


– For example, consider the Fahrenheit
scale of temperature.
– This scale is interval because the data
are ranked and differences (+ or -)
may be obtained.
– But there is no “absolute 0” (What
does 00 F mean?)

LO 1.4
Variables and Scales of Measurement
The Ratio Scale
• The strongest level of measurement.
• Ratio data may be categorized and ranked with
respect to some characteristic or trait.
• Differences between interval values are equal and
meaningful.
• There is an “absolute 0” or defined starting point.
“0” does mean “the absence of …” Thus, meaningful
ratios may be obtained.

LO 1.4
Overview of Statistics
Statistics

Collecting and Making Inferences


Describing Data from Samples

Sampling Visual Numerical Probability Estimating Testing Regression Quality


and Surveys Displays Summaries Models Parameters Hypotheses and Trends Control
Branches of Statistics?

• Two branches of statistics


– Descriptive Statistics
• collecting, organizing, and presenting the data.
– Inferential Statistics
• drawing conclusions about a population based on
sample data from that population.

LO 1.2
Population and Sample
• Population
– Consists of all items of interest.
• Sample
– A subset of the population.
• A sample statistic is calculated from the sample data
and is used to make inferences about the population
parameter.

LO 1.2
The Need for Sampling

Reasons for sampling from the population


• Too expensive to gather information on the entire
population
• Often impossible to gather information on the entire
population
Sample or Census?

A sample involves looking only at some items selected


from the population.

A census is an examination of all items in a defined


population.
Parameters and Statistics?
• Statistics are computed from a sample of n items, chosen
from a population of N items.
• Statistics can be used as estimates of parameters found in
the population.
• Symbols are used to represent population parameters and
sample statistics.
Parameters or Statistics
Finite or Infinite?
A population is finite if it has a definite size, even if its size is
unknown.
A population is infinite if it is of arbitrarily large size.
Rule of Thumb: A population may be treated as infinite when N
is at least 20 times n (i.e., when N/n ≥ 20)

N n

Here,
N/n ≥ 20
Descriptive Statistics

Numerical Description
Central Tendency
Dispersion
Numerical Description
Statistics are descriptive measures derived from a
sample (n items).
Parameters are descriptive measures derived from a
population (N items).

34
Central Tendency

• The central tendency is the middle or typical values


of a distribution.

• Central tendency can be assessed using a dot plot,


histogram or more precisely with numerical
statistics.
Central Tendency

Mean
• A familiar measure of central tendency.

Population Mean Sample Mean


N n
 xi  xi
i =1
= x = i =1
N n

• In Excel, use function =AVERAGE(Data) where Data is


an array of data values.
Central Tendency
Characteristics of the Mean
• Arithmetic mean is the most familiar average.
• Affected by every sample item.
• The balancing point or fulcrum for the data.
Central Tendency
Median
• The median (M) is the 50th percentile or midpoint of the
sorted sample data.
• M separates the upper and lower half of the sorted
observations.

• If n is odd, the median is the middle observation in the data


array.
• If n is even, the median is the average of the middle two
observations in the data array.
Central Tendency
Median
Central Tendency
Mode
• The most frequently occurring data value.
• Similar to mean and median if data values occur
often near the center of sorted data.
• May have multiple modes or no mode.
Central Tendency
Mode
• A bimodal distribution refers to the shape of the histogram
rather than the mode of the raw data.
• Occurs when dissimilar populations are combined in one
sample. For example,
Central Tendency
Skewness
Compare mean and median or look at histogram
to determine degree of skewness.
Dispersion
Variation is the “spread” of data points about the center of
the distribution in a sample. Consider the following
measures of dispersion:
Measures of Variation
Statistic Formula Excel Pro Con
Sensitive to
=MAX(Data)-
Range xmax – xmin MIN(Data)
Easy to calculate extreme data
values.
1 Plays a key role
Variance
(s2) n
 ( xi − x ) 2 =VAR(Data) in mathematical
Non-intuitive
meaning.
statistics.
Dispersion
Measures of Variation
Statistic Formula Excel Pro Con
Most common
measure. Uses Non-
Standard 1
deviation (s) n
 ( xi − x ) 2 =STDEV(Data) same units as the intuitive
raw data ($ , £, Rs, meaning.
etc.).

Measures relative Requires


Coefficient of s variation in percent non-
100  None
variation (CV) x so can compare negative
data sets. data.
Dispersion

Measures of Variation
Statistic Formula Excel Pro Con
Mean n
absolute  xi − x =AVEDEV(Data)
Easy to
Lacks “nice”
theoretical
i =1
deviation understand.
n properties.
(MAD)
Dispersion
Variance
• The population variance (s2) is N
 ( xi −  )
2
defined as the sum of squared
deviations around the mean  s2 = i =1
divided by the population size. N

• The sample variance ( s2) is


1
defined as the sum of squared s =  ( xi − x ) 2
2
deviations around the mean n
divided by the sample size.
Dispersion
Standard Deviation
• The square root of the variance.
• Explains how individual values in a data set vary from
the mean.
• Units of measure are the same as X.

Population N Sample
 ( xi −  )
2 1
standard
s = i =1
standard
n
 ( xi − x ) 2

deviation N deviation
Descriptive Statistics

Standardized Data
Percentiles, Quartiles and Box Plots
Standardized Data
Chebyshev’s Theorem
• Developed by mathematicians Jules Bienaymé
(1796-1878) and Pafnuty Chebyshev (1821-1894).

• For any population with mean  and standard


deviation s, the percentage of observations that lie
within k standard deviations of the mean must be at
least 100[1 – 1/k2].
Standardized Data
Chebyshev’s Theorem
• For k = 2 standard deviations,
100[1 – 1/22] = 75%
• So, at least 75.0% will lie within  + 2s
• For k = 3 standard deviations,
100[1 – 1/32] = 88.9%
• So, at least 88.9% will lie within  + 3s

• Although applicable to any data set, these limits tend to be


too wide to be useful.
Standardized Data
The Empirical Rule
• The normal or Gaussian distribution was named for Karl Gauss
(1771-1855).
• The normal distribution is symmetric and is also known as the
bell-shaped curve.
• The Empirical Rule states that for data from a normal
distribution, we expect that for
k = 1 about 68.26% will lie within  + 1s
k = 2 about 95.44% will lie within  + 2s
k = 3 about 99.73% will lie within  + 3s
Standardized Data
The Empirical Rule
• Distance from the mean is measured in terms of the
number of standard deviations.
Note: no
upper bound
is given.
Data values
outside
 + 3s
are rare.
Standardized Data

Defining a Standardized Variable


A standardized variable (Z) redefines each observation
in terms the number of standard deviations from the
mean.
Standardization formula for a xi − 
zi =
population: s

Standardization formula for xi − x


zi =
a sample: s
Percentiles, Quartiles and Box Plots
Percentiles
• Percentiles are data that have been divided into 100
groups. For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the test-takers
scored below you.
• Deciles are data that have been divided into 10 groups.
• Quintiles are data that have been divided into
5 groups.
• Quartiles are data that have been divided into
4 groups.
Percentiles, Quartiles and Box Plots
Quartiles
Quartiles are scale points that divide the sorted data into
four groups of approximately equal size.

Q1 Q2 Q3

Lower 25% | Second 25% | Third 25% | Upper 25%

The three values that separate the four groups are called Q1, Q2,
and Q3, respectively.
Percentiles, Quartiles and Box Plots
Quartiles
The first quartile Q1 is the median of the data values
below Q2, and the third quartile Q3 is the median of the
data values above Q2.

Q1 Q2 Q3
Lower 25% | Second 25% | Third 25% | Upper 25%

For first half of data, 50% For second half of data,


above, 50% above,
50% below Q1. 50% below Q3.
Correlation

Correlation Coefficient
The sample correlation coefficient is a statistic that describes
the degree of linearity between paired observations on two
quantitative variables X and Y.
n

 (x i − x )( yi − y )
r= i =1
n n

 ( xi − x )
i =1
2
 i
( y
i =1
− y ) 2
Correlation

Correlation Coefficient
Its range is -1 ≤ r ≤ +1.
Excel’s formula =CORREL(Xdata, Ydata)
Correlation
Illustration of Correlation Coefficients

You might also like