Statistics and Data Analysis: Professor William Greene Stern School of Business IOMS Department Department of Economics

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 54

Statistics and Data

Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
1/54 2: Descriptive Statistics
Statistics and Data
Analysis

Part 2 – Descriptive Statistics


Summarizing data with useful
statistics

2/54 2: Descriptive Statistics


Use random samples
and basic descriptive
statistics.
What is the ‘breach
rate’ in a pool of tens
of thousands of
mortgages? (‘Breach’
= improperly
underwritten or
serviced or otherwise
faulty mortgage.)

3/54 2: Descriptive Statistics


The forensic analysis was an examination of
statistics from a random sample of 1,500 loans.

4/54 2: Descriptive Statistics


Descriptive Statistics
Agenda
 Populations and Random Samples
 Descriptive Statistics for a Variable
 Measures of location: Mean,median,mode
 Measure of dispersion: Standard deviation
 Measuring Correlation of Two Variables
 Understanding correlation
 Measuring correlation
 Scatter plots and regression

5/54 2: Descriptive Statistics


Populations and Samples
 Population: Collection of all possible observations (data
points) on a variable
 Sample: A subset of the data points in the population
 Random sample: Defined by the way the sample data are
obtained. All points in the population are equally likely to
be drawn in any particular sample.
 What is the purpose of obtaining a sample?
To describe or learn about the population.
 The sample is observed
 The population is assumed.
 In order to learn confidently about the population from
a sample, the sample must be ‘random.’

6/54 2: Descriptive Statistics


Random Sampling
 A production process produces circuit boards. Boards are
produced in each hour with an average of 2 defects per board
when the process is in control. Each hour, the engineer
examines a random sample of 100 circuit boards. The average
number of defects per board in a particular 30 hour week is

Hour 1: Mean of 100 boards = 1.95,


Hour 2: “ 2.65,
Hour 3: “ 1.80, …
Hour 30: “ 2.35.
(These are estimates of the defect rate per board)

 The objective of drawing the sample is to determine


whether the process is in control or not. The process is
under control if the defect rate is < 2.)
 Method: Assuming the process is in control, would we
expect to see this rate of defects?

7/54 2: Descriptive Statistics


Random samples of behavior are difficult
to obtain, especially by telephone.

8/54 2: Descriptive Statistics


Nonrandom Samples

Nonrandom samples produce tainted,


sometimes not believable results
 Biased with respect to the population
 May describe a not useful specific subset of
the population.

9/54 2: Descriptive Statistics


(Non)Randomness of Samples
Sources of bias in samples (generally related)
 Bad sample design – e.g., home phone
surveys conducted during working hours
 Survey (non)response bias – e.g., opinion
surveys about service quality
 Participation bias – e.g., voluntary
participation in a survey
 Self selection – volunteering for a trial or an
opinion sample. (Shere Hite’s cultural
revolution)
 Attrition bias from clinical trials - e.g., if the
drug works, the subject does not come back.

10/54 2: Descriptive Statistics


Nonrandom
results in
incubator
funds.

The “NYU No
Action Letter”

11/54 2: Descriptive Statistics


Nonscientific, Nonrandom “(non)Sampling”

A Cultural Revolution …
“3000 women, ages 14
to 78 describe in their
own words …”

12/54 2: Descriptive Statistics


http://www.amazon.com/The-Hite-Report-National-Sexuality/dp/1583225692

A Cultural Revolution …
“3000 women, ages 14 to 78
describe in their own words …”
13/54 2: Descriptive Statistics
http://en.wikipedia.org/wiki/Shere_Hite

14/54 2: Descriptive Statistics


The Lesson…

Having a really big sample does not


assure you of an accurate result. It may
assure you of a really solid, really bad
(inaccurate) result.

15/54 2: Descriptive Statistics


How do ASCAP, BMI and SESAC allocate the royalty pool to specific authors and
publishers? The following relates to terrestrial radio, which, as a group, pays a lump sum
into the pool, which is then allocated by the PRSs.

http://old.cni.org/docs/ima.ip-workshop/Massarsky.html
16/54 2: Descriptive Statistics
A Descriptive Statistic
 Is … ?
 Describes what?
 The sample data
 The population that the data came from

17/54 2: Descriptive Statistics


Measures of Location
These are 30 hours of average defect data on sets of
circuit boards. Roughly what is the typical value?
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

 Location and central tendency


 There exists a distribution of values
 We are interested in the “center” of the distribution
 Two measures are the sample mean and the sample
median
 They look similar, and measure the same thing.
 They differ systematically (and predictably) when the data
are not ‘symmetric.’

18/54 2: Descriptive Statistics


The Sample Mean
These are 30 hours of average defect data on sets of circuit
boards. Roughly what is the typical value?
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

There are N observations (data points) in the sample.


Sample data : y = [y1 , y 2 , y 3 , y 4 ,...y N ]
In this sample, N = 30. The sample mean is
1 1

N
y= i=1
yi = [ y1 + y 2 + y 3 + y 4  ...  y N ]
N N
1 56.30
= (1.45 +... + 2.35) = =1.8767
30 30

19/54 2: Descriptive Statistics


It may be necessary to ‘weight’ aggregate data.
Average Home Listings

1
Listing = (896,800 + 713,864 +... +164,326) = 369,687
51

20/54 2: Descriptive Statistics


Averaging Averages?
 Hawaii’s average listing = $896,800
 Hawaii’s population = 1,275,194
 Illinois’ average listing = $377,683
 Illinois’ population = 12,763,371
 Illinois and Hawaii each get weight 1/51 = .
019607 when the mean is computed.
 Looks like Hawaii is getting too much
influence.

21/54 2: Descriptive Statistics


A Properly Weighted Average
Simple average = Listing =  States Weight State ListingState
1
Weight = = .019607
51
Illinois is 10 times as big as Hawaii. Suppose we use weights that are
in proportion to the state's population. (The weights sum to 1.0.)
Weight State varies from .001717 for Wyoming to .121899 for California

New average is 409,234 compared to 369,687 without weights, an


error of 11%. Sometimes an unequal weighting of the
observations is necessary.

State populations from http://www.factmonster.com/ipka/A0004986.html

22/54 2: Descriptive Statistics


Averaging Trending Time Series
Observations Is Usually Not Informative
Note how the mean changes completely depending
on what time interval is used to compute it.

Does the mean


over the entire
observation
period mean
anything? (Does
it estimate
anything
meaningful?)

23/54 2: Descriptive Statistics


The Sample Median
 Median = the middle observation after
data are sorted.
 Odd number: Central observation:
Med[1,2,4,6,8,9,17] = 6
 Even number: Midpoint between the
two central observations
Med[1,2,4,6,8,9,14,17] = (6+8)/2=7

24/54 2: Descriptive Statistics


Sample Median of (Sorted) Defects Data
1.05 1.30 1.40 1.45 1.45 1.50
1.55 1.60 1.60 1.65 1.65 1.70
1.70 1.70 1.70 1.90 1.90 1.95
2.05 2.05 2.05 2.20 2.25 2.30
2.30 2.35 2.35 2.35 2.60 2.70

12

Median = 1.8000
9

F req u en cy
6

Mean = 1.8767
3

0
1. 000 1. 500 2. 000 2. 500 3. 000
DEFECTS

25/54 2: Descriptive Statistics


(Let’s deduce
estimates of
the mean and
median from
the histogram.)

Tomorrow I will compute the average number of defectives


for a 61st day. What is a good guess of the number I will find?

26/54 2: Descriptive Statistics


Skewed Earnings Distribution
Mean vs. Median in Skewed Data
Monthly Earnings
N = 595,
Median = 800
These data are skewed to the right.
Mean = 883

The mean will exceed


the median when the
distribution is skewed
to the right. (The
M y skewness is in the
direction of the long
tail.)

27/54 2: Descriptive Statistics


Extreme Observations Distort
Means but Not Medians
 Outlying observations distort the mean
 Med [1,2,4,6,8,9,17] = 6
Mean[1,2,4,6,8,9,17] = 6.714
 Med [1,2,4,6,8,9,17000] = 6 (still)

Mean[1,2,4,6,8,9,17000] = 2432.8 (!)


 This typically occurs when there are some outlying
obervations, such as in cross sections of income or
wealth and/or when the sample is not very large.

28/54 2: Descriptive Statistics


29/54 2: Descriptive Statistics
The mean does not give information
about the shape of the distribution.

Two problems with the computations


(1) The data are ratings, not quantitative
(2) The mean does not suggest the
extreme nature of the data

30/54 2: Descriptive Statistics


The problem with the mean or median as a description of
a sample – more information is usually needed.

Both data sets have a mean of about 100.

31/54 2: Descriptive Statistics


Dispersion of the Observations
These are 30 hours of average defect data on sets of circuit
boards.
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

6
Histogram of Defects
We quantify the variation of the values
5
around the mean. Note the range is
4
from 1.05 to 2.70. This gives an idea
where the data lie. The mean plus a
Frequency

2 measure of the variation do the same


1 job.
0
1.2 1.6 2.0 2.4 2.8
Defects

32/54 2: Descriptive Statistics


The Problem with the Range as a Measure of Dispersion

These two data sets both have 1,000 observations


that range from about 10 to about 180

33/54 2: Descriptive Statistics


A Measure of Dispersion
The standard deviation is the interesting value. You need
to compute the variance to get the standard deviation.

1
 
2

N
 Variance = sy 2
= Yi - Y
N 1 i=1

1
 
2

N
 Standard deviation = sy = Yi - Y
N 1 i=1

Note the units of measurement. The standard deviation has the same units
as the mean. The standard deviation is the standard measure for the
dispersion (spread) of a set of values (sample of observations).

34/54 2: Descriptive Statistics


The variance is the average squared deviation of the
sample values from the mean. Why is N-1 in the
denominator of s2?

 Everyone else does it


 Minitab does it
 I have totally no idea.
 Tendency of the variance to be too
small when computed using 1/N when
the sample size, N, is itself small.
 (When N is large, it won’t matter.)

See HOG, p. 37
35/54 2: Descriptive Statistics
Computing a Standard Deviation
Y Deviation Squared
From Mean Deviation
1 -2.1 4.41
4 0.9 0.81
6 2.9 8.41 Sum = 31
0 -3.1 9.61
3 -0.1 0.01 Mean = 31/10=3.1
2 -1.1 1.21 Sum of squared deviations = 38.90
6 2.9 8.41 Variance = 38.90/(10-1) = 4.322
4 0.9 0.81
4 0.9 0.81 Standard Deviation = 2.079
1 -2.1 4.41
SUM 0.0 38.90

36/54 2: Descriptive Statistics


Standard Deviation
These are 30 hours of average defect data on sets of circuit
boards.
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

1 2 1
 i  
30
Variance = Y -1.8767 = 4.808667 = 0.165816
30 -1 i=1 30 -1

1 2
  
30
Standard Deviation = Yi -1.8767 = 0.407205
30 -1 i=1

37/54 2: Descriptive Statistics


Distribution of Values
Histogram of Defects

4
Frequency

0
1.2 1.6 2.0 2.4 2.8
Defects

38/54 2: Descriptive Statistics


Reliable Rules of Thumb
 Almost always, 66% of the observations in a sample will
lie in the range
[mean - 1 s.d. to mean + 1 s.d.]
 Almost always, 95% of the observations in a sample will
lie in the range
[mean - 2 s.d. to mean + 2 s.d.]
 Almost always, 99.5% of the observations in a sample will
lie in the range
[mean - 3 s.d. to mean + 3 s.d.]
When these rules are not met, they will almost be met. Data
nearly always act this way.

39/54 2: Descriptive Statistics


A Reliable Empirical Rule
Dotplot of Defects

Mean ± 2 s = 1.8767 ± 2(.4072)


= 1.06 to 2.69 includes 28/30 = 93%

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75


Defects

Mean ± 1 s =(1.47 to 2.28) includes 18/30 = 60%

Minitab: Graph  Dotplot …

40/54 2: Descriptive Statistics


Rules For Transformations
 Mean of a + bY = a + b y

 Standard deviation of a + bY = |b| sy

41/54 2: Descriptive Statistics


Which city is warmer, New York (USA) or Old
York (England)? Which is more variable?
Average Temperatures (high + low)/2
Month NY (f) OY(c) Month NY(f) OY(c)
Jan 29.5 2.0 Jul 75.5 15.5
Feb 32.0 2.0 Aug 73.5 15.0
Mar 35.0 4.5 Sep 66.0 13.0
Apr 50.0 8.5 Oct 55.0 9.5
May 60.5 9.5 Nov 45.0 6.0
Jun 70.0 13.0 Dec 35.0 3.5
City MeanStd.Dev. Min Max
Old York 8.5004.9132.00015.50
New York 52.2516.9329.5075.50

42/54 2: Descriptive Statistics


Application – Cost of Defects
These are 30 observations of average defect data on sets of
manufactured circuit boards.
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70
2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35
1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

Suppose the cost to repair defects is $25 + 10*Defects


I.e., a $25 setup cost plus $10 per defect.
Mean defects = 1.8767 Standard Deviation = 0.407205
Mean Cost = $25 + $10(1.8767) = $43.767
Standard Deviation Cost = $10(.407205) = $4.07205

43/54 2: Descriptive Statistics


Correlation
 Variables Y and X vary together
 Causality vs. correlation: Does movement in X
“cause” movement in Y in some metaphysical
sense?
 Correlation
 Simultaneous movement through a statistical relationship
 Simultaneous variation “induced” by the variation of a
common third effect

44/54 2: Descriptive Statistics


Samples of House Listings and
Per Capita Incomes at a Particular Time

45/54 2: Descriptive Statistics


Scatter Plot Suggests Positive Correlation
Scatterplot of Listing vs IncomePC

900000

800000

700000

600000
Listing

500000

400000

300000

200000

100000
15000 17500 20000 22500 25000 27500 30000 32500
IncomePC

46/54 2: Descriptive Statistics


Regression Measures Correlation

Scatterplot of Listing vs IncomePC

900000

800000
Regression Line: Listing = a + b IncomePC

700000

600000
Listing

500000

400000

300000

200000

100000
15000 17500 20000 22500 25000 27500 30000 32500
IncomePC

47/54 2: Descriptive Statistics


Correlation Is Not Causation
Price and Income seem to be “positively” related.

Scatterplot of Income vs GasPrice


27500

25000
The U.S. Gasoline
22500 Market. Data are
20000 yearly from 1953 to
Income

17500 2004. Plot of per


15000 capita income vs.
12500 gasoline price
10000 index.
20 40 60 80 100 120
GasPrice

48/54 2: Descriptive Statistics


The Hidden (Spurious) Relationship
Not positively “related” to each other; both positively related to “time.”

Scatterplot of Income vs Year Scatterplot of GasPrice vs Year


27500
120
25000

100
22500

20000 80

GasPrice
Income

17500
60
15000

12500 40

10000
20

1950 1960 1970 1980 1990 2000 2010 1950 1960 1970 1980 1990 2000 2010
Year Year

49/54 2: Descriptive Statistics


Correlation is the interesting number.
We must compute covariance and the two
standard deviations first.

1 1
   
2 2
 i  Yi - Y
N n
Standard Deviations: s X  X - X , s Y 
N  1 i=1 N  1 i=1

Covariance: s XY 
 
N
X
i=1 i 
 X Yi  Y 
N 1

s XY
Correlation : rXY  -1 < rXY < +1 Units free. A pure number.
sX sY

50/54 2: Descriptive Statistics


Correlation
Income

Scatterplot of Listing vs IncomePC

900000

800000

700000

600000
Listing

500000

400000
Listing
300000

200000

100000
15000 17500 20000 22500 25000 27500 30000 32500
IncomePC

rIncome,Listing = +0.591

51/54 2: Descriptive Statistics


Scatterplot of Noise vs Defects
2.6

2.4

2.2

Correlations 2.0

Noise
1.8

1.6

1.4

Scatterplot of cost vs Defects 1.2


25.28 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8
Defects
25.26

25.24

r = 0.0
25.22

25.20
cost

25.18

25.16

25.14
Scatterplot of Noise vs MoreNoise
25.12 2.6

25.10
2.4
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8
Defects
2.2

2.0

Noise
r = +1.0
1.8

1.6

1.4

1.2
1.50 1.75 2.00 2.25 2.50
MoreNoise

r = +0.5

52/54 2: Descriptive Statistics


Sample Statistics and Population Parameters
 Sample has a sample mean and standard
deviation Y and sY.
 Population has a mean, μ, and standard
deviation, σ.
 The sample “looks like” the population.
 The sample statistics resemble the population
features.
 The bigger is the RANDOM sample, the
closer will be the resemblance. We will study
this later in the course.

53/54 2: Descriptive Statistics


Summary
 Statistics to describe location (mean) and
spread (standard deviation) of a sample of
values.
 Interpretations
 Computations
 Complications
 Statistics and graphical tools to describe
bivariate (two variable) relationships
 Scatter plots
 Correlation

54/54 2: Descriptive Statistics

You might also like