Statistics and Data Analysis: Professor William Greene Stern School of Business IOMS Department Department of Economics

Statistics and Data
Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
1/54 2: Descriptive Statistics
Statistics and Data
Analysis
Part 2 – Descriptive Statistics

Summarizing data with useful
statistics

Use random samples
and basic descriptive
statistics.
What is the ‘breach
rate’ in a pool of tens
of thousands of
mortgages? (‘Breach’
= improperly
underwritten or
serviced or otherwise
faulty mortgage.)

The forensic analysis was an examination of
statistics from a random sample of 1,500 loans.

Descriptive Statistics
Agenda
 Populations and Random Samples
 Descriptive Statistics for a Variable
 Measures of location: Mean,median,mode
 Measure of dispersion: Standard deviation
 Measuring Correlation of Two Variables
 Understanding correlation
 Measuring correlation
 Scatter plots and regression

Populations and Samples
 Population: Collection of all possible observations (data
points) on a variable
 Sample: A subset of the data points in the population
 Random sample: Defined by the way the sample data are
obtained. All points in the population are equally likely to
be drawn in any particular sample.
 What is the purpose of obtaining a sample?
To describe or learn about the population.
 The sample is observed
 The population is assumed.
 In order to learn confidently about the population from
a sample, the sample must be ‘random.’

Random Sampling
 A production process produces circuit boards. Boards are
produced in each hour with an average of 2 defects per board
when the process is in control. Each hour, the engineer
examines a random sample of 100 circuit boards. The average
number of defects per board in a particular 30 hour week is
Hour 1: Mean of 100 boards = 1.95,

Hour 2: “ 2.65,
Hour 3: “ 1.80, …
Hour 30: “ 2.35.
(These are estimates of the defect rate per board)
 The objective of drawing the sample is to determine

whether the process is in control or not. The process is
under control if the defect rate is < 2.)
 Method: Assuming the process is in control, would we
expect to see this rate of defects?

Random samples of behavior are difficult
to obtain, especially by telephone.

Nonrandom Samples
Nonrandom samples produce tainted,

sometimes not believable results
 Biased with respect to the population
 May describe a not useful specific subset of
the population.

(Non)Randomness of Samples
Sources of bias in samples (generally related)
 Bad sample design – e.g., home phone
surveys conducted during working hours
 Survey (non)response bias – e.g., opinion
surveys about service quality
 Participation bias – e.g., voluntary
participation in a survey
 Self selection – volunteering for a trial or an
opinion sample. (Shere Hite’s cultural
revolution)
 Attrition bias from clinical trials - e.g., if the
drug works, the subject does not come back.

Nonrandom
results in
incubator
funds.
The “NYU No
Action Letter”

Nonscientific, Nonrandom “(non)Sampling”
A Cultural Revolution …
“3000 women, ages 14
to 78 describe in their
own words …”

http://www.amazon.com/The-Hite-Report-National-Sexuality/dp/1583225692
A Cultural Revolution …
“3000 women, ages 14 to 78
describe in their own words …”
http://en.wikipedia.org/wiki/Shere_Hite

The Lesson…
Having a really big sample does not

assure you of an accurate result. It may
assure you of a really solid, really bad
(inaccurate) result.

How do ASCAP, BMI and SESAC allocate the royalty pool to specific authors and
publishers? The following relates to terrestrial radio, which, as a group, pays a lump sum
into the pool, which is then allocated by the PRSs.
http://old.cni.org/docs/ima.ip-workshop/Massarsky.html
A Descriptive Statistic
 Is … ?
 Describes what?
 The sample data
 The population that the data came from

Measures of Location
These are 30 hours of average defect data on sets of
circuit boards. Roughly what is the typical value?
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
 Location and central tendency

 There exists a distribution of values
 We are interested in the “center” of the distribution
 Two measures are the sample mean and the sample
median
 They look similar, and measure the same thing.
 They differ systematically (and predictably) when the data
are not ‘symmetric.’

The Sample Mean
These are 30 hours of average defect data on sets of circuit
boards. Roughly what is the typical value?
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
There are N observations (data points) in the sample.

Sample data : y = [y1 , y 2 , y 3 , y 4 ,...y N ]
In this sample, N = 30. The sample mean is
1 1

N
y= i=1
yi = [ y1 + y 2 + y 3 + y 4  ...  y N ]
N N
1 56.30
= (1.45 +... + 2.35) = =1.8767
30 30

It may be necessary to ‘weight’ aggregate data.
Average Home Listings
1
Listing = (896,800 + 713,864 +... +164,326) = 369,687
51

Averaging Averages?
 Hawaii’s average listing = $896,800
 Hawaii’s population = 1,275,194
 Illinois’ average listing = $377,683
 Illinois’ population = 12,763,371
 Illinois and Hawaii each get weight 1/51 = .
019607 when the mean is computed.
 Looks like Hawaii is getting too much
influence.

A Properly Weighted Average
Simple average = Listing =  States Weight State ListingState
1
Weight = = .019607
51
Illinois is 10 times as big as Hawaii. Suppose we use weights that are
in proportion to the state's population. (The weights sum to 1.0.)
Weight State varies from .001717 for Wyoming to .121899 for California
New average is 409,234 compared to 369,687 without weights, an

error of 11%. Sometimes an unequal weighting of the
observations is necessary.
State populations from http://www.factmonster.com/ipka/A0004986.html

Averaging Trending Time Series
Observations Is Usually Not Informative
Note how the mean changes completely depending
on what time interval is used to compute it.
Does the mean

over the entire
observation
period mean
anything? (Does
it estimate
anything
meaningful?)

The Sample Median
 Median = the middle observation after
data are sorted.
 Odd number: Central observation:
Med[1,2,4,6,8,9,17] = 6
 Even number: Midpoint between the
two central observations
Med[1,2,4,6,8,9,14,17] = (6+8)/2=7

Sample Median of (Sorted) Defects Data
1.05 1.30 1.40 1.45 1.45 1.50
1.55 1.60 1.60 1.65 1.65 1.70
1.70 1.70 1.70 1.90 1.90 1.95
2.05 2.05 2.05 2.20 2.25 2.30
2.30 2.35 2.35 2.35 2.60 2.70
12
Median = 1.8000
9
F req u en cy
6
Mean = 1.8767
3
0
1. 000 1. 500 2. 000 2. 500 3. 000
DEFECTS

(Let’s deduce
estimates of
the mean and
median from
the histogram.)
Tomorrow I will compute the average number of defectives

for a 61st day. What is a good guess of the number I will find?

Skewed Earnings Distribution
Mean vs. Median in Skewed Data
Monthly Earnings
N = 595,
Median = 800
These data are skewed to the right.
Mean = 883
The mean will exceed

the median when the
distribution is skewed
to the right. (The
M y skewness is in the
direction of the long
tail.)

Extreme Observations Distort
Means but Not Medians
 Outlying observations distort the mean
 Med [1,2,4,6,8,9,17] = 6
Mean[1,2,4,6,8,9,17] = 6.714
 Med [1,2,4,6,8,9,17000] = 6 (still)
Mean[1,2,4,6,8,9,17000] = 2432.8 (!)

 This typically occurs when there are some outlying
obervations, such as in cross sections of income or
wealth and/or when the sample is not very large.

The mean does not give information
about the shape of the distribution.
Two problems with the computations

(1) The data are ratings, not quantitative
(2) The mean does not suggest the
extreme nature of the data

The problem with the mean or median as a description of
a sample – more information is usually needed.
Both data sets have a mean of about 100.

Dispersion of the Observations
boards.
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
6
Histogram of Defects
We quantify the variation of the values
5
around the mean. Note the range is
4
from 1.05 to 2.70. This gives an idea
where the data lie. The mean plus a
Frequency
2 measure of the variation do the same

1 job.
0
1.2 1.6 2.0 2.4 2.8
Defects

The Problem with the Range as a Measure of Dispersion
These two data sets both have 1,000 observations

that range from about 10 to about 180

A Measure of Dispersion
The standard deviation is the interesting value. You need
to compute the variance to get the standard deviation.
1
 
2

N
 Variance = sy 2
= Yi - Y
N 1 i=1
1
 
2

N
 Standard deviation = sy = Yi - Y
N 1 i=1
Note the units of measurement. The standard deviation has the same units
as the mean. The standard deviation is the standard measure for the
dispersion (spread) of a set of values (sample of observations).

The variance is the average squared deviation of the
sample values from the mean. Why is N-1 in the
denominator of s2?
 Everyone else does it

 Minitab does it
 I have totally no idea.
 Tendency of the variance to be too
small when computed using 1/N when
the sample size, N, is itself small.
 (When N is large, it won’t matter.)
See HOG, p. 37
Computing a Standard Deviation
Y Deviation Squared
From Mean Deviation
1 -2.1 4.41
4 0.9 0.81
6 2.9 8.41 Sum = 31
0 -3.1 9.61
3 -0.1 0.01 Mean = 31/10=3.1
2 -1.1 1.21 Sum of squared deviations = 38.90
6 2.9 8.41 Variance = 38.90/(10-1) = 4.322
4 0.9 0.81
4 0.9 0.81 Standard Deviation = 2.079
1 -2.1 4.41
SUM 0.0 38.90

Standard Deviation
boards.
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
1 2 1
 i  
30
Variance = Y -1.8767 = 4.808667 = 0.165816
30 -1 i=1 30 -1
1 2
  
30
Standard Deviation = Yi -1.8767 = 0.407205
30 -1 i=1

Distribution of Values
Histogram of Defects
4
Frequency
0
1.2 1.6 2.0 2.4 2.8
Defects

Reliable Rules of Thumb
 Almost always, 66% of the observations in a sample will
lie in the range
[mean - 1 s.d. to mean + 1 s.d.]
 Almost always, 95% of the observations in a sample will
lie in the range
 Almost always, 99.5% of the observations in a sample will
lie in the range
When these rules are not met, they will almost be met. Data
nearly always act this way.

A Reliable Empirical Rule
Dotplot of Defects
Mean ± 2 s = 1.8767 ± 2(.4072)

= 1.06 to 2.69 includes 28/30 = 93%
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75

Defects
Mean ± 1 s =(1.47 to 2.28) includes 18/30 = 60%
Minitab: Graph  Dotplot …

Rules For Transformations
 Mean of a + bY = a + b y
 Standard deviation of a + bY = |b| sy

Which city is warmer, New York (USA) or Old
York (England)? Which is more variable?
Average Temperatures (high + low)/2
Month NY (f) OY(c) Month NY(f) OY(c)
Jan 29.5 2.0 Jul 75.5 15.5
Feb 32.0 2.0 Aug 73.5 15.0
Mar 35.0 4.5 Sep 66.0 13.0
Apr 50.0 8.5 Oct 55.0 9.5
May 60.5 9.5 Nov 45.0 6.0
Jun 70.0 13.0 Dec 35.0 3.5
City MeanStd.Dev. Min Max
Old York 8.5004.9132.00015.50
New York 52.2516.9329.5075.50

Application – Cost of Defects
These are 30 observations of average defect data on sets of
manufactured circuit boards.
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
Suppose the cost to repair defects is $25 + 10*Defects

I.e., a $25 setup cost plus $10 per defect.
Mean defects = 1.8767 Standard Deviation = 0.407205
Mean Cost = $25 + $10(1.8767) = $43.767
Standard Deviation Cost = $10(.407205) = $4.07205

Correlation
 Variables Y and X vary together
 Causality vs. correlation: Does movement in X
“cause” movement in Y in some metaphysical
sense?
 Correlation
 Simultaneous movement through a statistical relationship
 Simultaneous variation “induced” by the variation of a
common third effect

Samples of House Listings and
Per Capita Incomes at a Particular Time

Scatter Plot Suggests Positive Correlation
Scatterplot of Listing vs IncomePC
900000
800000
700000
600000
Listing
500000
400000
300000
200000
100000
15000 17500 20000 22500 25000 27500 30000 32500
IncomePC

Regression Measures Correlation
900000
800000
Regression Line: Listing = a + b IncomePC
700000
600000
Listing
500000
400000
300000
200000
100000
15000 17500 20000 22500 25000 27500 30000 32500
IncomePC

Correlation Is Not Causation
Price and Income seem to be “positively” related.
Scatterplot of Income vs GasPrice

27500
25000
The U.S. Gasoline
22500 Market. Data are
20000 yearly from 1953 to
Income
17500 2004. Plot of per

15000 capita income vs.
12500 gasoline price
10000 index.
20 40 60 80 100 120
GasPrice

The Hidden (Spurious) Relationship
Not positively “related” to each other; both positively related to “time.”
Scatterplot of Income vs Year Scatterplot of GasPrice vs Year

27500
120
25000
100
22500
20000 80
GasPrice
Income
17500
60
15000
12500 40
10000
20
1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010
Year Year

Correlation is the interesting number.
We must compute covariance and the two
standard deviations first.
1 1
   
2 2
 i  Yi - Y
N n
Standard Deviations: s X  X - X , s Y 
N  1 i=1 N  1 i=1
Covariance: s XY 
 
N
X
i=1 i 
 X Yi  Y 
N 1
s XY
Correlation : rXY  -1 < rXY < +1 Units free. A pure number.
sX sY

Correlation
Income
900000
800000
700000
600000
Listing
500000
400000
Listing
300000
200000
100000
15000 17500 20000 22500 25000 27500 30000 32500
IncomePC
rIncome,Listing = +0.591

Scatterplot of Noise vs Defects
2.6
2.4
2.2
Correlations 2.0
Noise
1.8
1.6
1.4
Scatterplot of cost vs Defects 1.2

25.28 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8
Defects
25.26
25.24
r = 0.0
25.22
25.20
cost
25.18
25.16
25.14
Scatterplot of Noise vs MoreNoise
25.12 2.6
25.10
2.4
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8
Defects
2.2
2.0
Noise
r = +1.0
1.8
1.6
1.4
1.2
1.50 1.75 2.00 2.25 2.50
MoreNoise
r = +0.5

Sample Statistics and Population Parameters
 Sample has a sample mean and standard
deviation Y and sY.
 Population has a mean, μ, and standard
deviation, σ.
 The sample “looks like” the population.
 The sample statistics resemble the population
features.
 The bigger is the RANDOM sample, the
closer will be the resemblance. We will study
this later in the course.

Summary
 Statistics to describe location (mean) and
spread (standard deviation) of a sample of
values.
 Interpretations
 Computations
 Complications
 Statistics and graphical tools to describe
bivariate (two variable) relationships
 Scatter plots
 Correlation

Statistics and Data Analysis: Professor William Greene Stern School of Business IOMS Department Department of Economics

Uploaded by

Copyright:

Available Formats

You might also like

Statistics and Data Analysis: Professor William Greene Stern School of Business IOMS Department Department of Economics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics and Data Analysis: Professor William Greene Stern School of Business IOMS Department Department of Economics

Uploaded by

Copyright:

Available Formats

Statistics and Data

Part 2 – Descriptive Statistics

2/54 2: Descriptive Statistics

3/54 2: Descriptive Statistics

4/54 2: Descriptive Statistics

5/54 2: Descriptive Statistics

6/54 2: Descriptive Statistics

Hour 1: Mean of 100 boards = 1.95,

 The objective of drawing the sample is to determine

7/54 2: Descriptive Statistics

8/54 2: Descriptive Statistics

Nonrandom samples produce tainted,

9/54 2: Descriptive Statistics

10/54 2: Descriptive Statistics

11/54 2: Descriptive Statistics

12/54 2: Descriptive Statistics

14/54 2: Descriptive Statistics

Having a really big sample does not

15/54 2: Descriptive Statistics

17/54 2: Descriptive Statistics

 Location and central tendency

18/54 2: Descriptive Statistics

There are N observations (data points) in the sample.

19/54 2: Descriptive Statistics

20/54 2: Descriptive Statistics

21/54 2: Descriptive Statistics

New average is 409,234 compared to 369,687 without weights, an

State populations from http://www.factmonster.com/ipka/A0004986.html

22/54 2: Descriptive Statistics

Does the mean

23/54 2: Descriptive Statistics

24/54 2: Descriptive Statistics

25/54 2: Descriptive Statistics

Tomorrow I will compute the average number of defectives

26/54 2: Descriptive Statistics

The mean will exceed

27/54 2: Descriptive Statistics

Mean[1,2,4,6,8,9,17000] = 2432.8 (!)

28/54 2: Descriptive Statistics

Two problems with the computations

30/54 2: Descriptive Statistics

Both data sets have a mean of about 100.

31/54 2: Descriptive Statistics

2 measure of the variation do the same

32/54 2: Descriptive Statistics

These two data sets both have 1,000 observations

33/54 2: Descriptive Statistics

34/54 2: Descriptive Statistics

 Everyone else does it

36/54 2: Descriptive Statistics

37/54 2: Descriptive Statistics

38/54 2: Descriptive Statistics

39/54 2: Descriptive Statistics

Mean ± 2 s = 1.8767 ± 2(.4072)

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75

Mean ± 1 s =(1.47 to 2.28) includes 18/30 = 60%

Minitab: Graph  Dotplot …