Professional Documents
Culture Documents
Ec310 Day 3 Lecture Notes
Ec310 Day 3 Lecture Notes
• We want a way to assess whether there’s a linear relationship between two variables
• We’ll cover two today – and then a third in the final weeks of the class
Statistics – Measurement in Economics
Day 3 – Video 1
Numerical Descriptive Techniques
Measures of Linear Relationship
• The sample covariance between x and y is: • In the same way as there was a “shortcut” for the sample variance, there’s also a shortcut for
Sample the sample covariance:
Pn mean of x " n Pn Pn #
i=1 (xi x̄)(yi ȳ) 1 X x y
sxy = i=1 i i=1 i
n 1 Sample
sxy = x i yi
mean of y
n 1 i=1
n
• Note that the sample covariance between x and x is:
• (Once again, when y=x this simplifies to the same equation we had before for the variance)
Pn
i=1 (xi x̄)(xi x̄) This should
sxx = s2x
sxx = look familiar!
n 1
3 4
2+6+7 Pn
x̄ = =5 +
3 i=1 (xi x̄)(yi ȳ) +
sxy = sxy = 17.5
13 + 20 + 27 n 1
Option 1: Covariance ȳ = = 20 ȳ
y
3
–
• Consider the following three sets of data: n x y
–
n x y (x x̄) (x x̄)2 (y ȳ) (y ȳ)2(x x̄)(y ȳ) Answers 1 2 13
x
x̄
Data
2 6 20
1 2 13 s2x = Set #1
3 7 27
–
Data
2 6 20 s2y =
+
Set #1
3 7 27 sxy = 1 2 27 ȳ sxy = 17.5
Data
y
2 6 20
Set #2
Data
1 2 27 s2x = 3 7 13 –
2 6 20 s2y = +
Set #2 1 2 20
3 7 13 sxy = Data x
x̄
2 6 27
s2x Set #3
1 2 20 = 3 7 13 +
Data
2 6 27 s2y = +
Set #3
3 7 13 sxy =
ȳ sxy = 3.5
y
• In each set, the values of X and Y are the same – the only difference is the order of the Y’s
–
• Let’s see how this affects the variance and covariance calculations… +
5 x x̄
>
>
>
>
>
>
>
>
>
>
: Perfect negative
9 1 linear relationship
Which of these
has a higher r?
11 12
Option 2: Correlation Coefficient Option 2: Correlation Coefficient
• So the correlation coefficient has one major advantage over the covariance: • Example: Let’s return to our tool
shop example
• It’s a unit-free measure
• We saw earlier that the units had
• As a result, ranges from -1 to +1 with a straightforward interpretation a huge impact on the covariance
• So it yields a quick answer to the question: How strong is the association between x & y? • What about the correlation?
• But as we’ve seen, we must keep a few caveats in mind: • Correlation between electrical
costs (in dollars) and tools
• It is only a measure of the linear association produced is 0.87
• It measures the extent to which the variables are related, not the slope • Correlation between electrical
costs (in cents) and tools
• When we want a measure of the slope, we’ll turn to regression analysis (Keller introduces produced is 0.87
regression in Chapter 4, but we’ll wait and tackle it in more depth near the end of the term
13 14
2+6+7
x̄ = =5 sxy +
3 + sxy = 17.5
13 + 20 + 27 r=
Option 2: Correlation Coefficient ȳ = = 20 sx sy 17.5
ȳ
y
3 r=
– 2.65 ⇥ 7
• Recall our data from before:
n x y
– = 0.943
n x y (x x̄) (x x̄)2 (y ȳ) (y ȳ)2 (x x̄)(y ȳ) Answers 1 2 13
x
x̄
Data
2 6 20
1 2 13 -3 9 -7 49 21 s2x =7 Set #1
3 7 27
–
Data + sxy = 17.5
2 6 20 1 1 0 0 0 s2y = 49
Set #1 1 2 27 17.5
3 7 27 2 4 7 49 14 sxy =17.5 ȳ
Data r=
y
2 6 20 2.65 ⇥ 7
Set #2
Data
1 2 27 -3 9 7 49 -21 s2x = 7 3 7 13 – = 0.943
2 6 20 1 1 0 0 0 s2y = 49 +
Set #2
3 7 13 2 4 -7 49 -14 sxy = -17.5 1 2 20
Data
2 6 27
x
x̄
s2x =7 Set #3
1 2 20 -3 9 0 0 0 3 7 13 + sxy = 3.5
Data
2 6 27 1 1 7 49 7 s2y = 49 +
Set #3
3 7 13 2 4 -7 49 -14 sxy = -3.5 3.5
sx = 2.65 ȳ r=
y
• What is the standard deviation of x? The standard deviation of y?
2.65 ⇥ 7
p p sy = 7 – = 0.189
sx = 7 = 2.65 sy = 49 = 7 +
15 x x̄
Parameters and Statistics
• For all the sample statistics we’ve defined for our sample (of size n), there’s an underlying
parameter which represents the true value
• Another way of thinking about the “truth” is that it’s what we’d get if our sample included
every observation in the entire population (n=N)
• For most statistics, we have special notation to indicate the underlying parameter...
Sample Population
Mean x̄ µ
It’s a common statistical
Variance s2 2
convention to use latin
characters for statistics
Standard Deviation s
and greek characters for
parameters...
Covariance sxy xy
Correlation Coefficient r ⇢
Recall
• Back in our first video we said “Statistics is a way to get information from data”
Econ 310 • But where does out data come from in the first place?
• Studies differ in their study design and also in the method used to collect data • The most common strategy for collecting data is to use a survey, but there are many
different types of survey to choose from:
• Say we’d like to study the effect of free school lunch on student absences by comparing the
attendance rates of students who do and do not receive free lunch • Personal interview
• Observational: Compare those who already receive free lunches with those who do not • Mail
• Experimental: Compare students who were randomly selected into a treatment group • Internet
(free lunch) with others who were randomly selected into a placebo group (paid lunch)
• Each has its own advantages and disadvantages:
• Strategies for Data Collection:
• Phone, mail, and internet are obviously dramatically cheaper
• Direct observation: Observe a classroom
• But personal interview surveys typically have the highest response rates, as well as the
• Survey data: Design a questionnaire and interview students, teachers, or parents lowest rates of non-response & misunderstood questions
3 4
Simple Random Sampling Simple Random Sampling
• The next step in conducting a survey is to decide on a sampling plan, which is specifies • Example: An IRS employee must
how the sample will be drawn from a population choose a sample of 20 out of
1,000 returns to audit
• We’ll typically assume that the respondents was selected as a simple random sample
• Here I use Stata to randomly
• Example: Put all the names from the population into a hat and draw three names select numbers from 1 to 1000
• Key characteristic: Every possible sample of the same size is equally likely to be chosen • Note that we got a duplicate,
and we can’t audit the same
• Another way of looking at it: the identify of the person picked in the first draw (not you) return twice, so I change my
has no impact on the probability you’ll be picked in a subsequent draw code to draw a 21st return.
5 6
• Note how easily we generated a list random numbers. This wasn’t always the case…
7 8
Note the mistake in the video:
Stratified Random Sampling Stratified Random Sampling 5% of 400 is 20, not 40!
• Instead of a simple random sample, one could also conduct a stratified random sample • After the population has been stratified, we draw a simple random sample for each strata:
• To obtain such a sample, we begin by separating the population into mutually exclusive Income Population Over-sampling
sets (aka strata) Category Proportion Proportion
Under 25K 25% 25%
Example 1 Example 2 Example 3
Gender Age Occupation 25K-39K 40% 25%
Male < 20 Professional 40K-60K 30% 25%
Female 20-30 Clerical
Over 60K 5% 25%
31-40 Blue collar
41-50 Other • Example: If we only have sufficient resources to sample 400 people, we would draw
51-60
> 60 • 100 from 1st group, 160 from 2nd group, 120 from 3rd group, 20 from 4th group
• Why? Well this would ensure we don’t get unlucky and that our sample is representative
(at least along this one dimension)
• A cluster sample is where you randomly sample groups of respondents • We’d like the sample to have similar characteristics as the underlying population
• Example: A researcher wants to interview students, so she randomly samples schools • However, for any given sample that we draw, the sample statistic will very rarely be exactly
and then interviews all (or a random sample) of the students in the selected classrooms equal to the underlying population parameter
• Advantage: Can save time and money • It’s useful to divide the difference between the statistic and the parameter into two parts:
• All the sampling plans we’ve discussed so far are random samples – sometimes • Non-sampling error
researchers work with non-probability samples
• Example 1: Go outside and interview everyone who walks past (convenience sample)
• Example 2: Ask respondents about their social network then interview their friends as well
(snowball sample)
11 12
Sampling Error Non-Sampling Error
• Sampling error refers to differences between the sample and the population that exist only • Non-sampling error results from mistakes made in the acquisition of the data or flaws in the
because of random variation in the people that happened to get selected into the sample sampling plan
• If the sampling plan is solid, we expect it to yield a sample with characteristics similar to • So we are not unlucky (as with sampling error), instead we are incompetent
the underlying population
• In this case we can end up with a statistic that is consistently either too high or too low
• But for the actual sample we drew, luck determines how close the sample statistic is to
the population parameter – they won’t be exactly equal unless we’re incredibly lucky • Increasing the sample size won’t reduce non-sampling error, so one can’t just throw
resources at the problem to make it go away...
• As we’ll see more formally in future lectures, increasing the sample size reduces this error
• Some sources of non-sampling error are: Faulty data, non-response, and selection bias
Never 70%
So of those responding:
At least once 0% Never 100%