Ec310 Day 3 Lecture Notes

Measures of Linear Relationship
• We want a way to assess whether there’s a linear relationship between two variables
Econ 310 • What are our options?
• We’ll cover two today – and then a third in the final weeks of the class
Statistics – Measurement in Economics
Day 3 – Video 1
Numerical Descriptive Techniques
Measures of Linear Relationship
(Also see Keller Chapter 4-4) 2
Option 1: Covariance Option 1: Covariance
• The sample covariance between x and y is: • In the same way as there was a “shortcut” for the sample variance, there’s also a shortcut for
Sample the sample covariance:
Pn mean of x " n Pn Pn #
i=1 (xi x̄)(yi ȳ) 1 X x y
sxy = i=1 i i=1 i
n 1 Sample
sxy = x i yi
mean of y
n 1 i=1
n
• Note that the sample covariance between x and x is:
• (Once again, when y=x this simplifies to the same equation we had before for the variance)
Pn
i=1 (xi x̄)(xi x̄) This should
sxx = s2x
sxx = look familiar!
n 1
3 4
2+6+7 Pn
x̄ = =5 +
3 i=1 (xi x̄)(yi ȳ) +
sxy = sxy = 17.5
13 + 20 + 27 n 1
Option 1: Covariance ȳ = = 20 ȳ
y
3
–
• Consider the following three sets of data: n x y
–
n x y (x x̄) (x x̄)2 (y ȳ) (y ȳ)2(x x̄)(y ȳ) Answers 1 2 13
x
x̄
Data
2 6 20
1 2 13 s2x = Set #1
3 7 27
–
Data
2 6 20 s2y =
+
Set #1
3 7 27 sxy = 1 2 27 ȳ sxy = 17.5
Data
y
2 6 20
Set #2
Data
1 2 27 s2x = 3 7 13 –
2 6 20 s2y = +
Set #2 1 2 20
3 7 13 sxy = Data x
x̄
2 6 27
s2x Set #3
1 2 20 = 3 7 13 +
Data
2 6 27 s2y = +
Set #3
3 7 13 sxy =
ȳ sxy = 3.5
y
• In each set, the values of X and Y are the same – the only difference is the order of the Y’s
–
• Let’s see how this affects the variance and covariance calculations… +
5 x x̄
Option 1: Covariance Option 1: Covariance
• So the covariance is: • Example: Correlation between

output and electrical costs in a
• Positive when variables move in the same direction small tool shop
• Negative when variables move in opposite directions • We have electrical costs

measured two ways –
• “Close to zero” when there is no particular pattern in dollars and in cents
• One problem – what values should we consider close to zero? • Covariance between electrical
costs (in dollars) and tools
• In the last case, 3.5 was close to zero and 17.5 was not produced is 36.06
• So maybe a cutoff of 5 would serve us well? or maybe 10? • Covariance between electrical
costs (in cents) and tools
• Consider an example...
produced is 3606
• Is the second pair of variables

more closely related?
• This brings us to option 2...

7 8
8
>
> +1 Perfect positive
linear relationship
Option 2: Correlation Coefficient >
>
>
>
• The correlation coefficient (aka the coefficient of correlation) is defined as the covariance
divided by the standard deviations of the variables >
>
• The sample correlation coefficient is: >
>
r=
sxy <
sx sy
• What’s a large correlation coefficient? Small? r= 0 No linear
relationship whatsoever
>
>
>
>
>
>
>
>
>
>
: Perfect negative
9 1 linear relationship
Option 2: Correlation Coefficient Option 2: Correlation Coefficient
Which of these
has a higher r?
Will r be positive, negative, or close to zero?

Which of these
has a higher r?
11 12
Option 2: Correlation Coefficient Option 2: Correlation Coefficient
• So the correlation coefficient has one major advantage over the covariance: • Example: Let’s return to our tool
shop example
• It’s a unit-free measure
• We saw earlier that the units had
• As a result, ranges from -1 to +1 with a straightforward interpretation a huge impact on the covariance
• So it yields a quick answer to the question: How strong is the association between x & y? • What about the correlation?
• But as we’ve seen, we must keep a few caveats in mind: • Correlation between electrical
costs (in dollars) and tools
• It is only a measure of the linear association produced is 0.87
• It measures the extent to which the variables are related, not the slope • Correlation between electrical
costs (in cents) and tools
• When we want a measure of the slope, we’ll turn to regression analysis (Keller introduces produced is 0.87
regression in Chapter 4, but we’ll wait and tackle it in more depth near the end of the term
13 14
2+6+7
x̄ = =5 sxy +
3 + sxy = 17.5
13 + 20 + 27 r=
Option 2: Correlation Coefficient ȳ = = 20 sx sy 17.5
ȳ
y
3 r=
– 2.65 ⇥ 7
• Recall our data from before:
n x y
– = 0.943
n x y (x x̄) (x x̄)2 (y ȳ) (y ȳ)2 (x x̄)(y ȳ) Answers 1 2 13
x
x̄
Data
2 6 20
1 2 13 -3 9 -7 49 21 s2x =7 Set #1
3 7 27
–
Data + sxy = 17.5
2 6 20 1 1 0 0 0 s2y = 49
Set #1 1 2 27 17.5
3 7 27 2 4 7 49 14 sxy =17.5 ȳ
Data r=
y
2 6 20 2.65 ⇥ 7
Set #2
Data
1 2 27 -3 9 7 49 -21 s2x = 7 3 7 13 – = 0.943
2 6 20 1 1 0 0 0 s2y = 49 +
Set #2
3 7 13 2 4 -7 49 -14 sxy = -17.5 1 2 20
Data
2 6 27
x
x̄
s2x =7 Set #3
1 2 20 -3 9 0 0 0 3 7 13 + sxy = 3.5
Data
2 6 27 1 1 7 49 7 s2y = 49 +
Set #3
3 7 13 2 4 -7 49 -14 sxy = -3.5 3.5
sx = 2.65 ȳ r=
y
• What is the standard deviation of x? The standard deviation of y?
2.65 ⇥ 7
p p sy = 7 – = 0.189
sx = 7 = 2.65 sy = 49 = 7 +
15 x x̄
Parameters and Statistics
• For all the sample statistics we’ve defined for our sample (of size n), there’s an underlying
parameter which represents the true value
• Another way of thinking about the “truth” is that it’s what we’d get if our sample included
every observation in the entire population (n=N)
• For most statistics, we have special notation to indicate the underlying parameter...
Sample Population
Mean x̄ µ
It’s a common statistical
Variance s2 2
convention to use latin
characters for statistics
Standard Deviation s
and greek characters for
parameters...
Covariance sxy xy
Correlation Coefficient r ⇢
Recall
• Back in our first video we said “Statistics is a way to get information from data”
Econ 310 • But where does out data come from in the first place?
• In particular, we’d like to explore the following questions:

Statistics – Measurement in Economics • How is data gathered?
• Given this, do we expect it to be accurate and reliable?
• And, in particular, do we expect it to be representative of the underlying population?

Day 3 – Video 2
Data Collection and Sampling Plans
(Also see Keller Chapter 5) 2
Methods of Collecting Data Surveys
• Studies differ in their study design and also in the method used to collect data • The most common strategy for collecting data is to use a survey, but there are many
different types of survey to choose from:
• Say we’d like to study the effect of free school lunch on student absences by comparing the
attendance rates of students who do and do not receive free lunch • Personal interview
• Strategies for getting variation: • Phone
• Observational: Compare those who already receive free lunches with those who do not • Mail
• Experimental: Compare students who were randomly selected into a treatment group • Internet
(free lunch) with others who were randomly selected into a placebo group (paid lunch)
• Each has its own advantages and disadvantages:
• Strategies for Data Collection:
• Phone, mail, and internet are obviously dramatically cheaper
• Direct observation: Observe a classroom
• But personal interview surveys typically have the highest response rates, as well as the
• Survey data: Design a questionnaire and interview students, teachers, or parents lowest rates of non-response & misunderstood questions
• Administrative data: Get suspension statistics from school records
3 4
Simple Random Sampling Simple Random Sampling
• The next step in conducting a survey is to decide on a sampling plan, which is specifies • Example: An IRS employee must
how the sample will be drawn from a population choose a sample of 20 out of
1,000 returns to audit
• We’ll typically assume that the respondents was selected as a simple random sample
• Here I use Stata to randomly
• Example: Put all the names from the population into a hat and draw three names select numbers from 1 to 1000
• Key characteristic: Every possible sample of the same size is equally likely to be chosen • Note that we got a duplicate,
and we can’t audit the same
• Another way of looking at it: the identify of the person picked in the first draw (not you) return twice, so I change my
has no impact on the probability you’ll be picked in a subsequent draw code to draw a 21st return.
• Of course, we don’t normally use a hat…
5 6
Simple Random Sampling Simple Random Sampling
• Note how easily we generated a list random numbers. This wasn’t always the case…
The Amazon Reviews for

this book are worth
checking out...
Great book; should be

released as an audiobook!
Riddled with typos
Would be easier to use if

they sorted the numbers!
7 8
Note the mistake in the video:
Stratified Random Sampling Stratified Random Sampling 5% of 400 is 20, not 40!
• Instead of a simple random sample, one could also conduct a stratified random sample • After the population has been stratified, we draw a simple random sample for each strata:
• To obtain such a sample, we begin by separating the population into mutually exclusive Income Population Over-sampling
sets (aka strata) Category Proportion Proportion
Under 25K 25% 25%
Example 1 Example 2 Example 3
Gender Age Occupation 25K-39K 40% 25%
Male < 20 Professional 40K-60K 30% 25%
Female 20-30 Clerical
Over 60K 5% 25%
31-40 Blue collar
41-50 Other • Example: If we only have sufficient resources to sample 400 people, we would draw
51-60
> 60 • 100 from 1st group, 160 from 2nd group, 120 from 3rd group, 20 from 4th group
• Why? Well this would ensure we don’t get unlucky and that our sample is representative
(at least along this one dimension)
• But the more common use of stratified sampling is to over-sample subpopulations – as

illustrated in the final column of the table…
9 10
Other sampling methods Sampling and Non-Sampling Errors
• A cluster sample is where you randomly sample groups of respondents • We’d like the sample to have similar characteristics as the underlying population
• Example: A researcher wants to interview students, so she randomly samples schools • However, for any given sample that we draw, the sample statistic will very rarely be exactly
and then interviews all (or a random sample) of the students in the selected classrooms equal to the underlying population parameter
• Advantage: Can save time and money • It’s useful to divide the difference between the statistic and the parameter into two parts:
• Disadvantage: Typically results in less precise estimates • Sampling error
• All the sampling plans we’ve discussed so far are random samples – sometimes • Non-sampling error
researchers work with non-probability samples
• Example 1: Go outside and interview everyone who walks past (convenience sample)
• Example 2: Ask respondents about their social network then interview their friends as well
(snowball sample)
• Why would anyone opt for these sampling plans?
11 12
Sampling Error Non-Sampling Error
• Sampling error refers to differences between the sample and the population that exist only • Non-sampling error results from mistakes made in the acquisition of the data or flaws in the
because of random variation in the people that happened to get selected into the sample sampling plan
• If the sampling plan is solid, we expect it to yield a sample with characteristics similar to • So we are not unlucky (as with sampling error), instead we are incompetent
the underlying population
• In this case we can end up with a statistic that is consistently either too high or too low
• But for the actual sample we drew, luck determines how close the sample statistic is to
the population parameter – they won’t be exactly equal unless we’re incredibly lucky • Increasing the sample size won’t reduce non-sampling error, so one can’t just throw
resources at the problem to make it go away...
• As we’ll see more formally in future lectures, increasing the sample size reduces this error
• Some sources of non-sampling error are: Faulty data, non-response, and selection bias
• Example: I want to know how prevalent marijuana use is in Wisconsin, so I asked my

class last fall how many of them have ever smoked a joint (show of hands):
Never 70%
So of those responding:
At least once 0% Never 100%
No Response 30% At least once 0%

13 14

Ec310 Day 3 Lecture Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ec310 Day 3 Lecture Notes

Uploaded by

Copyright:

Available Formats

Measures of Linear Relationship

Econ 310 • What are our options?

(Also see Keller Chapter 4-4) 2

Option 1: Covariance Option 1: Covariance

Option 1: Covariance Option 1: Covariance

• So the covariance is: • Example: Correlation between

• Negative when variables move in opposite directions • We have electrical costs

• Is the second pair of variables

• This brings us to option 2...

Option 2: Correlation Coefficient Option 2: Correlation Coefficient

Will r be positive, negative, or close to zero?

• In particular, we’d like to explore the following questions:

• Given this, do we expect it to be accurate and reliable?

• And, in particular, do we expect it to be representative of the underlying population?

(Also see Keller Chapter 5) 2

Methods of Collecting Data Surveys

• Strategies for getting variation: • Phone

• Administrative data: Get suspension statistics from school records

• Of course, we don’t normally use a hat…

Simple Random Sampling Simple Random Sampling

The Amazon Reviews for

Great book; should be

Riddled with typos

Would be easier to use if

• But the more common use of stratified sampling is to over-sample subpopulations – as

Other sampling methods Sampling and Non-Sampling Errors

• Disadvantage: Typically results in less precise estimates • Sampling error

• Why would anyone opt for these sampling plans?

• Example: I want to know how prevalent marijuana use is in Wisconsin, so I asked my

No Response 30% At least once 0%

You might also like