Professional Documents
Culture Documents
Applied Regression Analysis by Christer Thrane
Applied Regression Analysis by Christer Thrane
Applied Regression Analysis by Christer Thrane
Christer Thrane
Contents
Acknowledgements ix
PART 1
The basics 1
vi Contents
Appendix A: A short introduction to Stata’s point-and-click routine 34
Appendix B: A short introduction to the point-and-click routine in
SPSS 38
PART 2
The foundations 67
Contents vii
Exercises with data and solutions 94
Informal glossary to Part 2 96
PART 3
The extensions 99
viii Contents
8.7 Final thoughts and further reading 147
Notes 148
Exercises with data and solutions 148
PART 4
Regression purposes, academic regression projects and the
way ahead 161
References 184
Subject index 188
Acknowledgements
Thanks to Andy Humphries for the opportunity to write this textbook for
Routledge. Thanks also to three reviewers for very valuable suggestions to
the first draft of the book, and to Henning Bueie for initially helping me with
some of the book’s figures.
I published a textbook in Norwegian on regression analysis in 2017. Some
of the material in the present text, especially in Chapters 1 and 2, draws heavily
on that book. Also, some of the examples in the present text are adaptions
and extensions from my 2017 textbook. It goes without saying that all those
contributing to my 2017 textbook also deserve a thank you in regard to the
present book.
Lillehammer, Norway, August 24, 2019
Christer Thrane
Part 1
The basics
• What regression analysis is, how it works and (hopefully) what it can do
on topics related to your own field of study
• How to do (or as we often say: run) a regression analysis using real data
on your own computer
• How to interpret the regression output from statistical programs regarding
what happens in your data
1
What is regression analysis?
Figure 1.1 House sale prices by house sizes. 100 houses sold in a small Norwegian city
during 2017.
Figure 1.2 House sale prices by house sizes, with regression line. 100 houses sold in a
small Norwegian city during 2017.
1.2 What this book is about and why it has been written
This book is about the statistical method called regression analysis. Or just
regression. Regression is the work horse among the statistical methods used in
the economic and social sciences. For this reason, there are lots of textbooks
around which, in turn, raises a question: What has this book to offer that
others don’t? The short answer is that this book, contrary to others, is built on
the “wicked” idea that one does not need to know much about mathematics
and statistics to understand regression analysis and to perform it to perfec-
tion. Five more detailed answers are:
6 The basics
or “trait” –we write and speak of variables. The size of houses varies among
houses; it is therefore a variable. The sale price of houses also varies among
houses; it is yet another variable. An obvious rule of thumb: A variable is
something that varies. Also, we make a distinction between independent and
dependent variables. An independent variable is something that possibly
brings about statistical variation in another variable; the size of houses is thus
the independent variable in our example.1 (The notes in the book are placed
at the end of each main chapter.) The dependent variable is the variable we
are interested in explaining, i.e. something that possibly is being affected by
another, independent variable. Thus, the sale price of houses is the dependent
variable in our example. But since dependent and independent variables are
long and heavy names in terms of usage, we often opt for y as shorthand for
the former and x as shorthand for the latter. Figure 1.3 illustrates.
Other popular names for the dependent variable are outcome variable,
response variable, criterion variable and regressand. Similarly, for the inde-
pendent variable, we have explanatory variable, predictor variable, treatment
variable, covariate and regressor.
Units or observations are key concepts when doing (or running) a regres-
sion analysis. The units or observations in our example are houses. More
typically, the units are persons in the economic and social sciences, but they
could be almost anything: firms, countries, products, stocks etc. When people
who have answered a questionnaire are the units, we call them respondents
or subjects. More formally, the units/observations are those we have data or
variable information about.
To actually do a regression analysis we need two things. First, we need
data, i.e. variable information on a set of units. The data sets used in this
book are available for download from the book’s website. Second, we need
access to a computer with statistical software. In this book I will use Stata
x y
• The statistical relationship between performance (x) and pay (y) for the
soccer players in the Premier League, if any?
• The statistical relationship between innovation activity (x) and revenue
(y) for 150 firms in the service industry, if any?
• The statistical relationship between the quality of wines (x) and their
price (y) for 275 bottles of red wine, if any?
• The statistical relationship between class size (x) and class- averaged
grades (y) for 180 high school classes, if any?
• The statistical relationship between total hours of exercise per week (x)
and self-assessed health (y) for 350 women aged 40 to 50 years, if any?
• The statistical relationship between number of days at a travel destination
(x) and total trip expenditures (y) for 400 travelers, if any?
Regression fits the bill in all of the above examples. That is, regression analysis
examines if and how an independent variable, x, covaries or correlates with a
dependent variable, y, for a given set of units/observations.2
8 The basics
In doing so, I will eschew symbols, equations and formulas to the greatest
possible extent and instead adopt the example-driven show-and-don’t-tell
approach.
The remainder of Part 1 of the book is organized as follows. Chapter 2
explains and illustrates so-called bivariate or simple regression analysis in all
its forms, i.e. a regression with only one independent variable. We start by
more accurately describing the variable relationship mentioned in the opening
of the book. This chapter is the stepping-stone for the rest of the book.
Chapter 3 extends bivariate regression by incorporating several independent
variables into the regression simultaneously. This is called multiple regression,
and we will cross that bridge when we get to it.
Notes
1 I use the phrase “…something that possibly brings about statistical variation
in another variable” with intention, to emphasize that we are dealing with non-
deterministic statistical relationships (i.e. associations, correlations, covariations)
and not necessarily causal relationships (i.e. cause-and-effect relationships). If not
stated otherwise, the examples in the book are about statistical associations and not
causal relationships. I will say more on using regression analysis to reveal causal
relationships in Chapters 9 and 10 and sections 11.4 and 11.6.
2 In Chapter 3 we will incorporate several independent variables simultaneously in a
regression analysis.
3 For this reason, the book does not start with a crash course in statistics. In my
experience this is a waste of time for (the many) students who fear statistics or
math. For those who do not (the few), it is just redundant. Neither do I devote
a chapter to variables and their measurement levels; nothing about measurement
levels is required knowledge to get a first, firm grip on regression. But if you still
want to know or could do with a recap, the appendix to this chapter offers a
primer on variables and their measurement levels and how this relates to regression
analysis.
Numerical variables The sale price of houses, hours of exercising per week,
annual earnings and size of budget are numerical variables in the sense
that they are interval-scaled. That is, their values are numbers with specific
distances between them; a house costs for example 100 000 €, 200 000 €,
300 000 € or so on, and the distance between 100 000 € and 200 000 € is the
same as the distance between 300 000 € and 400 000 €. (Note also that 340 000
€ is twice as much as 170 000 €.) In the same vein, someone who exercises for
six hours per week exercises three more hours (or twice as much) as one who
exercises for three hours per week.
Categorical variables Put bluntly there is no natural (interval) scale for strict
categorical variable; a person is either male or female, a firm has either a female
or a male CEO and a company produces either services or physical products
and sometimes both. That is, there are no low or high values for strict categor-
ical variables; they are unordered with an either/or logic to them. The variable
gender has two unordered values (male or female); the variable season has
four unordered values (spring, summer, autumn or winter). In statistical ana-
lysis we still assign numbers to these categorical states, such as 1 for woman
and 0 for man or vice versa, 0 to 3 for the four seasons and 1 for yes and 0 for
no. But we do not attach any low or high interpretations to these numbers.
10 The basics
to “1–2 times per week” is not the same as the distance from “3–4 times per
week” to “5 times per week or more”. For more on variables and their meas-
urement levels, see for example Agresti and Finlay (2014).
2
Linear regression with a single
independent variable
12 The basics
.8
Sale price in million euros
.2 0.4 .6
Figure 2.1 House sale prices by house sizes, with regression line. 100 houses sold in a
small Norwegian city during 2017.
Graph generated in Stata.
to get the results in the output. As you see, this command reappears above
the results. The Courier font is used for commands and variable names
throughout the book. For those familiar with a point-and-click routine to
do statistical analysis (like in SPSS; cf. Appendix B to this chapter), Stata
also offers this capability; see Appendix A to this chapter or Stata’s YouTube
channel www.youtube.com/ c hannel/ U CVk4G4nEtBS4tLOyHqustDA.
Stata’s internal help menu system also provides a wealth of valuable infor-
mation and resources for the newbie.
Before turning to the practicalities of regression, let’s stop and go through
the preliminaries in Stata-output 2.1. We have 100 observations (houses) for
both variables; see under the heading “Obs”. The mean or average sale price
is almost 443 000 €, and the mean house size is almost 141 m2; see under the
heading “Mean”. The standard deviation (“Std. Dev.”) –which we can
think of as the average deviation from the mean –is about 141 000 for sale
price and 39 for square meter.2 The minimum (“Min”) and maximum (“Max”)
refer to the smallest and largest value of each variable.
Linear regression 13
Stata-output 2.1 Descriptive statistics for the variables price_Euro and square_m.
14 The basics
Regression of house sale prices (price_Euro) on house sizes
Stata-output 2.2
(square_m).
Where does the regression line cross the y-axis? Stata estimates this to be at 73
130 €, i.e. the number below the regression coefficient. In Stata this figure is called
“_cons”. There are many other numbers of interest in the output, but they need
not concern us now. We return to some of them in Chapters 3 and 4. (The output
from statistics programs generally includes more than we really need!)
Do-files in Stata. When doing regression (as well as statistical analysis in general),
I always encourage students to write down the commands they use. This is smart
for two reasons. First, you do not need to memorize all commands, i.e. you may
copy and paste a lot. Second, it makes it easier to re-run a regression a year later to
obtain the exact same results. The outputs from Stata presented so far in this book
are based on the do-file (i.e. Stata’s name for a command file) in Stata-output 2.3.
Stata-output 2.3 Stata do-file generating the output from Stata in this book so far.
Linear regression 15
Short explanation of Stata do-file. Line 1 tells Stata that I am using Stata version
15.0 if something in the command structure is to change in later versions; line
2 and 3 is useful to include in all do-files for reasons we do not go into. Line 5
starts with //, which means that Stata neglects what is written on this line. In
other words, such lines are where we write notes to ourselves. Line 6 tells Stata
where to find the data, which on my PC is at “C:\Users\700245\Desktop\
temp_myd\housedata.dta”. (On your computer this path, of course, will
be different.) Line 9 yields the descriptive statistics in Stata-output 2.1, whereas
line 12 yields the regression in Stata-output 2.2. Line 15 tells Stata to generate
(“gen” for short) the variable SP_mills (sale price in millions), and this vari-
able is then used to create the two graphs in the book (lines 17/18 and 22/23). The
sign /* at the end of the line tells Stata to jump to the next line starting with */.
The R program. R is gaining popularity among economists and social scientists.
For this reason and the fact that the program is free of charge, the initial analyses
and results in the book will also be presented in R. The commands will be written
in the more user-friendly add-on to R called RStudio (www.rstudio.com/).
R is a purely command-driven program, and it thus has some common ground
with writing do-files in Stata, although the command language is a bit different.
R-output 2.1 illustrates. Note that the data file name ends with the suffix “.csv”.
This is the preferred (spreadsheet) data format used for R in this book.
R-output 2.1 R command-file generating the regression output from Stata-output 2.2,
descriptive statistics (Stata-output 2.1) and a regression plot similar to
Figure 2.1.
16 The basics
Short explanation of R command-file. In Line 1, # tells R to neglect what is
written on this line. Line 2 and 3 finds the data on my computer. Line 6 and
7 define the variables for the upcoming regression. Line 10 is the regression
command (~ is called Tilde), and line 11 tells R to show the results of this
regression at the bottom of the output. The regression coefficient is 2 630
€ (under the heading “Estimate”) and the regression line crosses the y-
axis at 73 130 € (the number right above the regression coefficient). In R this
figure is called the “Intercept”. Lines 14, 15 and 16 produce the graph
in Figure 2.2. Lines 19 and 20 are the commands for generating descriptive
statistics for the two variables used in the regression analysis.
A related example. We have seen a positive relationship between house size
and sale price; is it also an association between the price of houses and where
they are situated? The housing data contains a variable measuring the distance
from city center to where the house is located (dist_city_c). This vari-
able measures the distance in question in kilometers (km), and the descriptive
statistics are presented in Stata-output 2.4.
Note that the average house is situated slightly more than 3 km from the
city center, whereas the closest one is located only 600 meters (0.6 km) from
it. How is this distance-variable related to sale price? A regression analysis
answers this question, and the results appear in Stata-output 2.5.
0.8
Sale price in million euros
0.6
0.4
0.2
0.0
50 100 150 200
Square meters
Figure 2.2 House sale prices by house sizes, with regression line. 100 houses sold in a
small Norwegian city during 2017.
Graph generated in R (see R-output 2.1).
Linear regression 17
Stata-output 2.4 Descriptive statistics for the variable dist_city_c.
The regression coefficient is about –25 000 €. This implies that a house
located 5 km from the city center on average costs 25 000 less than a house
situated 4 km from the city center (i.e. a static interpretation; cf. note 4). More
generally, we have that the further away from the city center a house is located,
the cheaper it tends to be relatively speaking.
Stata do-file. In terms of obtaining the above results with the Stata do-file (see
Stata-output 2.3), the only new code to write is “sum dist_city_c” on
one line (descriptive statistics) and “regress price_Euro dist_city_
c” (regression command) on the next line. We’ll skip that.
Stata-output 2.5 Regression of house sale prices (price_Euro) on distance to city
center (dist_city_c).
18 The basics
.8
Sale price in million euros
.2 .4
0 .6
0 2 4 6 8 10
Distance to city center, km
Figure 2.3 House sale prices by distance to city center, with regression line. 100 houses
sold in a small Norwegian city during 2017.
Graph generated in Stata.
R command-file. To get the results above with the R command-file (see R-
output 2.1), you need to replace square_m with dist_city_c in line 7
(two times) and lines 10 and 20. We’ll skip that also.
Figure 2.3 is a plot of the above regression and demonstrates the downward-
sloping regression line.5 (The only new features in the do/command-files is
that dist_city_c replaces square_m in the graph commands.)
Two key insights emerge from the analyses and figures above: (1) A positive
regression coefficient implies an upward-sloping regression line, whereas (2) a
negative regression coefficient means a downward-sloping regression line.
A positive regression coefficient indicates a positive correlation or association
between x and y; a negative regression coefficient suggests a negative correl-
ation or association between x and y. A regression coefficient of 0, or zero
correlation/association, implies an exact horizontal regression line, i.e. that x
is not statistically related to/associated with y.6
If the above analyses and comments were not too hard to grasp, I can
assure that you now have gotten the gist of regression analysis. In section 2.2
we formalize the ideas behind regression a bit more, following up on the rela-
tionship between the size and sale price of houses.
Linear regression 19
b1
b0
x x+1 x
Figure 2.4 The regression line, the constant (b0) and the regression coefficient/
slope (b1).
20 The basics
Linear regression 21
The renovation variable has two distinct values, 0 and 1. The general term
for this binary variable is a dummy variable or dummy. Recall the general
regression equation y = b0 + b1x from section 2.2, and assume now that x is
the renovation variable. Then we have (approximately) that
When the independent variable is a dummy, the constant/b0 has a clear inter-
pretation. We see this by plugging the two possible values for x (0 and 1) into
the equation. For non-renovated houses (x = 0) this yields
That is, the constant/b0 refers to the mean price of houses not renovated. Here,
such houses cost 406 000 € on average.
In general, the constant/b0 refers to x = 0, i.e. where the regression line
crosses the y-axis (cf. Figure 2.4). The similar calculation or prediction for
renovated houses (x = 1) yields
sale price = 406 000 + 77 300 × 1 → sale price = 406 000 + 77 300 →
sale price = 483 300 €.
22 The basics
As before only our imagination sets the boundaries on what might be the
dependent variable, y, and the independent dummy, x, within a regression
framework. And on that note it is time for some new research questions on
new data.
24 The basics
Figure 2.5 Earnings per year by player performance, with regression line. 242 players
in the Norwegian version of the PL.
Graph generated in Stata.
Linear regression 25
Taking stock, the two regressions above suggest that performance pays
off. That is, players with better performance and international players (who
presumably play for their national teams owing to better, prior performance)
make more money on average than worse-performing and non-international
players.13
In terms of code for do-files in Stata or command-files in R, the above
analyses of soccer players’ earnings require nothing that is not covered in
Stata-output 2.3 or in R-output 2.1. The only things changing are the variable
names and the name of the data file.
26 The basics
The 293 female students on average exercise for about 3 hours per week.
The exercise interest index has a mean score of 4.53 on the scale from 0 to 6,
whereas 43% of the female students are fitness center members (and 57% are
not). The regression between hours of exercise and fitness center member-
ship is shown in Stata-output 2.11; the analogous regression between hours
of exercise and exercise interest appears in Stata-output 2.12.
Stata-output 2.11 says that female members of fitness centers on average
exercise for 2.23 more hours per week than female non-members (b1), whereas
the constant (b0) implies that female non-members exercise for a little more
than 2 hours per week on average.
Stata-output 2.12 shows that female students scoring 5 on the interest index
exercise for 1.37 more hours per week on average than female students scoring
4 on this index (b1). In summary, therefore, membership of fitness centers
and exercise interest is positively associated with female students’ amount of
weekly exercise hours.
The above analyses of female students’ exercise hours require nothing that
is not covered in Stata-output 2.3 or in R-output 2.1 in terms of code for do-
files/command files. Again, the only things that have changed are the variable
names and the name of the data file.
Linear regression 27
Stata-output 2.12 Regression of exercise per week (phys_act_h) on exercise interest
(exercise_int).
28 The basics
possible lines) that minimizes this sum of squared residuals (hence the name
“Ordinary Least Squares”). Wooldridge (2000; see section 2.8) covers the
particulars in this respect.
To get the actual estimate of b1, we apply the formula (where ∑ is a
summation sign)
b1 =
∑ ( x − x )( y − y ) ,
∑(x − x)
2
for which the necessary numbers (i.e. data) are to be found in Table 2.1.
Here, y refers to the price of sausages and x to the sausages’ meat content.
This yields
8381.86
b1 = ≈ 3.555.
2357.43
Linear regression 29
Data for the regression presented in Figure 2.6 and the estimates in
Table 2.1
Stata-output 2.13.
Obs. y x x – x y – y (x – x )(y – y ) (x – x )2
Once b1 is found, we apply the formula (where “¯” on top of x and y means
the mean of x and y)
b0= y − b1 x
Stata-output 2.13 presents the regression and verifies that our calculations are
correct. But relax, you normally don’t need to do these calculations yourself –
that’s what statistical software is for!
30 The basics
Stata-output 2.13 Price of sausages (saus_price) by meat content (meat_
content).
y = b0 + b1x + e.
This e is an unobserved variable, and we assume that its mean value is 0. For
this reason, we may disregard e in practice (but see section 5.4). Yet the e
reminds us of the fact that many independent variables might in real life affect
our dependent variable. We could even say that multiple regression, the topic
of Chapter 3, to some extent boils down to extracting independent variables
from e and putting them into the actual regression. In this sense, multiple
regression is among other things about reducing the size of the error term (e)
as compared with bivariate regression.
Linear regression 31
Further reading
There are many textbooks and research articles on regression. These have
different strengths and weaknesses, depending on the readers’ background,
skills and motivation. My three personal favorites are Allison (1999), Berk
32 The basics
(2004) and Wooldridge (2000 or later editions). Especially the latter covers just
about everything one needs to know about regression analysis.16 Chapter 5 in
Allison (1999), Chapters 2 and 3 in Berk (2004) and Chapter 2 in Wooldridge
are especially relevant for the present chapter.
Other important texts, especially tailor-made for economists but very read-
able to others as well, are Dougherty (2011) and Gujarati (2015).
Wolf and Best (2015) provide a compact yet very thorough introduction to
regression for the social sciences in general. In the chapters to come I will add
more books and papers to this list.
Notes
1 In essence, when doing statistical analysis, we throw away lots of information on
individual units in order to get a clearer picture of the general trends among the
most typical units. That is, we gain more insight by throwing away information (see
Stigler, 2016)!
2 I said in sections 1.2 and 1.5 that this book presupposes no skills in mathematics/
statistics. This still applies, but you should know how to do basic calculations from
a set of numbers. If you do not, here is how: The data (numbers) 7, 8, 8, 8, 9, 9, 9,
9, 10, 10, 10, 11 and 11 are the shoe sizes (a numerical variable) of 13 adult males
(the units). To obtain the mean shoe size for these 13 men, you add the 13 numbers
together and divide this sum by 13. This yields 9.15 after rounding. The typical
spread or variation around this mean is measured by the standard deviation. To
obtain the standard deviation, you first find the variance (s2) using the formula
∑( y − y)
2
s 2
= ,
n −1
where ∑ is the summation sign, yi are the individual shoe sizes, y is the mean shoe
size (i.e. 9.15) and n is the number of units (i.e. 13). The standard deviation is then
s = s2 ,
which becomes 1.21 after rounding. You should calculate this on your own as an
exercise!
3 Other statistics programs have similar commands and/or point-and-click routines.
The point-and-click routines for Stata and SPSS are shown in Appendices A and B
in this chapter.
4 The first interpretation is dynamic; the second is static. Since we are comparing
different houses at one point in time, a static interpretation makes most sense. We
are not examining what happens to sale prices after an increase in house sizes, which
calls for a dynamic interpretation. Normally we restrict dynamic interpretations to
causal relationships; more on this in Chapters 9 and 10 and sections 11.4 and 11.6.
That said, dynamic interpretations are often more elegant in terms of expressing
how x appears to affect y in the data.
5 It should be mentioned that the results of regressions, especially for data with sev-
eral hundreds or thousands of observations, seldom are presented in a graphic
fashion. In this sense the many graphic displays of regression results in this book
serve mostly a pedagogical purpose.
Linear regression 33
6 That is, at least not in a straight or linear way. Non-linear statistical relationships
between x and y are covered in Chapter 6.
7 This is why Stata calls it “_cons”.
8 The strictly correct equation, which we will return to in section 2.7, is y = b0 + b1x
+ e, where e is the so-called error term. For now, we just assume that e = 0.
9 It is customary to denote 1 as presence of something or yes, and 0 as non-presence
of something or no. But the research question at hand may trump this. Always
make sure variables with two values or categories are coded 1 and 0 in a regression
analysis (and not 1 and 2).
10 Exactly 47% of the houses in the data have been renovated; the descriptive
Stata command “tab renovate” tells you this. In R the similar command is
“table(renovate)”.
11 Since an independent dummy only has two values, it does not yield a graph as nice
as those for numerical independent variables (e.g. Figure 2.1). For this reason, the
results for dummies are seldom presented graphically.
12 Remember: We are comparing different soccer players at one point in time; we
are not analyzing how better or worse performance affect (future) earnings; cf.
note 4.
13 If this book was an introduction to quantitative methods in the social sciences,
we would say that player performance is a theoretical construct and that “average
season performance on a 1 to 10 scale” and “national team appearances” are
operationalizations of this construct. Although the topic of operationalization
is important in quantitative social science, we will not cover it explicitly in
this book.
14 An index is a combination of two or more variables, most typically the sum or
average of these. The exercise interest index is based on two statements (ordinal
variables) with the answer categories apply very poorly (0), apply poorly (1),
apply rather poorly (2), neither poorly nor well (3), apply rather well (4), apply well
(5) and apply very well (6). The first statement reads “Doing physical exercise is
important for me” and the second reads “I am not interested in doing physical
exercise”. The latter statement’s answer categories were reversed (6 = 0, 5 = 1,
4 = 2, 3 = 3, 2 = 4, 1 = 5 and 0 = 6) before the two variables were summed and
divided by two to keep the original scale from 0 to 6. A female student scoring
6 on the index (maximum interest) thus means that she answered apply very well
to the first statement and apply very poorly to the second, which in turn yields
12/2 = 6. This index is an operationalization of the theoretical construct exer-
cise interest (cf. note 13). For practical purposes we treat this index and similar
ones as numerical variables in the social sciences, although this is not strictly
correct.
15 But if you are to make predictions from your regression analysis (cf. section 2.3),
you should always make sure to include the constant. Otherwise your predictions
will be wrong!
16 The first edition of Wooldridge’s textbook came out in the year 2000, and sev-
eral editions have been published after this. Technically this book is on econo-
metrics, which is economists’ umbrella name for applying statistics/mathematics
to economic data. In practice, however, this often boils down to doing regression
analysis.
34 The basics
Linear regression 35
Stata-output 2.15 The box appearing after following the procedure in Stata-output 2.14.
Then, open/click on the blank space for “Variables:” (see arrow) and click/
select the variables square_m and price_Euro to produce Stata-output
2.16. Then click on OK and Stata-output 2.17 appears.
36 The basics
Stata-output 2.17 Results of the procedure from Stata-output 2.14 to 2.16.
Linear regression 37
Stata-output 2.19 The box appearing after following the procedure in Stata-output 2.18.
38 The basics
Stata-output 2.21 Results of the procedure from Stata-output 2.18 to 2.20.
Linear regression 39
SPSS-output 2.2 The box appearing after following the procedure in SPSS-output 2.1.
Then, click on/select square_m and price_Euro to get them into the
(active) window on the right-hand side (→), like in SPSS-output 2.3. Then
click on OK and SPSS-output 2.4 appears.
40 The basics
SPSS-output 2.4 Results of the procedure from SPSS-output 2.1 to 2.3.
Linear regression 41
SPSS-output 2.6 The box appearing after following the procedure in SPSS-output 2.5.
42 The basics
SPSS-output 2.8 Results of the procedure from SPSS-output 2.5 to 2.7.
3
Linear regression with several
independent variables
Multiple regression
. use "C:\Users\700245\Desktop\temp_myd\housedata.dta"
------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms | 61456.68 12880.64 4.77 0.000 35895.47 87017.89
_cons | 233101 45739.87 5.10 0.000 142331.8 323870.3
------------------------------------------------------------------------------
The regression coefficient for bedrooms suggests that a house with three
bedrooms on average costs about 61 500 € more than a house with two
bedrooms. This is much as anticipated. In other words, our regression gives
rise to the visual model depicted in Figure 3.1.
Let’s for a second examine what happens if we also include the independent
variable square meter (square_m) in the regression, leaving the further tech-
nicalities of this aside for the moment. Stata-output 3.2 presents the results.
Note that the Stata command simply adds “square_m” to the previous
regression command in Stata-output 3.1.
44 The basics
Figure 3.1 Visual model describing how number of bedrooms affect house sale price
positively, based on the regression in Stata-output 3.1.
------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms | 3897.93 12107.64 0.32 0.748 -20132.38 27928.24
square_m | 2572.144 310.7542 8.28 0.000 1955.382 3188.905
_cons | 67913.05 40460.62 1.68 0.096 -12390.07 148216.2
------------------------------------------------------------------------------
Multiple regression 45
Square meter
Figure 3.2 Visual model showing how the variable square meter affects number of
bedrooms and sale price positively, based on the regressions in Stata-
outputs 3.1 and 3.2.
46 The basics
Multiple regression 47
Stata-output 3.3 Regression of hours of exercise per week (phys_act_h) on fitness
center membership (fitness_ c) (upper part) and regression of
hours of exercise per week (phys_act_h) on fitness center member-
ship (fitness_c) and exercise interest (exercise_int) (lower part).
. use "C:\Users\700245\Desktop\temp_myd\fem_stud_sport.dta"
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fitness_c | 2.225656 .3694298 6.02 0.000 1.498563 2.952749
_cons | 2.08982 .2422609 8.63 0.000 1.613015 2.566626
------------------------------------------------------------------------------
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fitness_c | .9233283 .3562325 2.59 0.010 .2221994 1.624457
exercise_int | 1.232009 .1352184 9.11 0.000 .9658751 1.498143
_cons | -2.934121 .5914624 -4.96 0.000 -4.098224 -1.770017
------------------------------------------------------------------------------
There are several ways of expressing the findings of this multiple regres-
sion: We may say that fitness center members exercise almost one hour (0.92)
more than non-members, controlling for exercise interest.3 Or we could say the
same, but finish off with holding exercise interest constant. A third way is to
use the ceteris paribus (i.e. all-else-equal) clause, meaning that all other inde-
pendent variables included in the regression are being held constant or being
controlled for. In the general multiple regression case of y on x1, x2, x3 and
so on, the regression coefficients for x1, x2, x3 and so on are always controlled
for each other in this way. But when we said above that the multiple regres-
sion compared fitness center members and non-members equally interested in
exercising, this should be taken a bit metaphorically. In more precise math-
ematical terms, which we will not dive into (see Fox, 1997 or Wooldridge,
2000), multiple regression partitions or isolates the effects of the independent
variables into the unique contributions from each independent variable in the
regression (see also section 3.5).
In this example we controlled the fitness center membership dummy of
key interest (x1) for exercise interest (x2). In practice we typically control
for several independent variables simultaneously (and not only x2), making
48 The basics
comparisons between even more homogenous groups: fitness center members
and non-members who are equal with respect to, say, age, dietary habits, body
weight, fitness motivation etc. The umbrella term for such variables are rele-
vant control variables, i.e. variables we (a) believe affect y and (b) believe to be
correlated with the independent variable of key interest, x1. But even though
using several controls makes multiple regression messier in mathematical
terms, it is by no means more complex to carry out in statistics programs such
as Stata, R, SPSS and the like.
Now for an important thought experiment. Imagine adding two more con-
trol variables, say age and fitness motivation, to the multiple regression at the
bottom of Stata-output 3.3. (Both are likely to affect exercise hours and to be
correlated with the membership dummy.) Two things might then happen to the
regression coefficient (effect) of fitness center membership: (1) It might remain
at around 0.9 hours. If so, we may add these controls to the ceteris paribus
clause and become more certain that fitness center membership has a unique
effect on hours of exercising, as the journalist claims. (2) The regression coeffi-
cient (effect) might get much smaller and in the extreme case become zero.4 If
so, we can no longer claim that fitness center membership has a unique effect
on exercise hours. The journalist’s claim would, in short, be false. Implication
1: If we want to be certain our independent variable of key interest (here: fitness
center membership) has a larger-than-zero effect on the dependent variable in
question (here: hours of exercise), we must always make every possible effort
to include all relevant control variables in our multiple regression.5 Implication
2: It is quite hard to actually become dead sure that our data comprise all rele-
vant control variables in regression-situations like this…
Formally, the multiple regression equation may be written as
Multiple regression 49
budget etc.) Before concluding that innovation has this or that unique
effect on revenues based on a bivariate regression, these other inde-
pendent variables should be controlled for in a multiple regression.
• Whether classes with less than 20 pupils really outperform classes with
more than 20 pupils in terms of class-averaged grades
Comment: The two groups of classes may also differ with respect to
other independent variables than number of pupils influencing grades
(e.g. pupil skills, pupil motivation, school vicinity affluence etc.) Before
concluding that number of pupils has this or that unique effect on grades
based on a bivariate regression, these other independent variables should
be controlled for in a multiple regression.
• Whether women aged between 40 to 50 years and exercising for three
hours or more per week really report better self-assessed health than
similar women exercising for less than three hours per week
Comment: The two groups of women may also differ with respect to
other independent variables than exercise habits affecting health (e.g.
smoking prevalence, dietary habits, educational level etc.) Before con-
cluding that exercise has this or that unique effect on health based on
a bivariate regression, these other independent variables should be con-
trolled for in a multiple regression.
• Whether better-performing players in the Premier League really receive
more pay than lower-performing players
Comment: The two groups of players may also differ with respect to
other independent variables than performance affecting pay (e.g. age,
player position, number of matches played, goals scored in career, player
nationality, number of assists etc.) Before concluding that performance
has this or that unique effect on pay based on a bivariate regression,
these other independent variables should be controlled for in a multiple
regression.7
50 The basics
Multiple regression 51
categories are ordered, but this is unimportant. Note that most students do
not smoke (69%), whereas 16% smoke sometimes and 15% smoke on a daily
basis. (The data are from some years back; fewer students smoke now.) In the
data never is assigned the value of 0, whereas sometimes is assigned 1 and daily
is assigned 2.
. use "C:\Users\700245\Desktop\temp_myd\all_stud_sport.dta"
. tab smoke
How may we examine if and how this smoking status variable is associated
with exercise hours? We make two new dummy variables from the three levels
of the smoking status variable, and then use these two dummies in a multiple
regression. (This is the general way to proceed; a faster way within Stata will
be described shortly.)
The upper part of Stata-output 3.5 first creates the dummy some_smoke,
indicating whether a student smokes sometimes (1) or not (0). The two
commands are “gen some_ smoke = smoke” and “recode some_
smoke 1=1 else=0”. Next, the dummy daily_smoke is created in the
same manner, save for the small difference in the recoding procedure. Note
that we did not (have to) make a dummy for the never category. At the bottom
of the output, we have the multiple regression results. The constant (_cons)
refers to the dummy we did not make, i.e. the mean of exercise hours for those
never smoking (i.e. about 4 hours per week).8
Interpretation: Those smoking sometimes (some_smoke = 1) exercise for
1.33 less hours per week than those who never smoke (the constant); those
smoking daily (daily_smoke = 1) exercise for 1.62 less hours per week
than those who never smoke (the constant). The difference in exercise hours
between those smoking daily and those smoking sometimes is therefore about
0.29 hours (1.62–1.33 = 0.29). Smoking appears to be associated with exercise
hours, although the difference between two groups of smokers is of a small
magnitude.
52 The basics
Stata-output 3.5 Regression of hours of exercise per week (phys_act_h) on smoking
sometimes (some_smoke) and smoking daily (daily_smoke) and
commands for creating these two latter dummy variables.
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Multiple regression 53
R-output 3.2 R command-file generating the statistical output of Stata-output 3.5.
54 The basics
Stata has a faster and less error-prone way of doing the regression in Stata-
output 3.5. This is shown in Stata-output 3.6, where the command prefix “i.”
by default tells Stata to use the lowest value of the independent variable of
interest (here: never smoke; coded 0) as the reference in a multiple regression.
(This may be altered; see the help menu.)
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoke |
sometimes | -1.330681 .4931528 -2.70 0.007 -2.300043 -.3613191
daily | -1.620885 .5121421 -3.16 0.002 -2.627573 -.6141969
|
_cons | 4.04024 .2143309 18.85 0.000 3.618942 4.461537
------------------------------------------------------------------------------
Many real-life multiple regressions include both sets of dummies (i.e. cat-
egorical variables) and numerical variables at the same time. This does not
complicate matters; the only thing to remember is the ceteris paribus or all-
else-equal interpretation of the coefficients in the regression.
Multiple regression 55
where membership is coded 1 and non-membership is coded 0. The third inde-
pendent variable is the familiar smoking variable (smoke) from section 3.4.
The fourth independent variable, an ordinal variable, reads “The monetary
costs involved in exercising put restrictions on my participation” (applies very
poorly = 0, applies poorly = 1, neither/nor = 2, applies well = 3 and applies
very well = 4). The mean of this 0–4 scale, labelled cost in the data, is 1.53,
where larger values imply more severe subjective cost restrictions. Finally, the
age of the student (age) is the last independent variable (mean = 22.34 years).
The multiple regression appears in Stata-output 3.7. (Since there is nothing
new in terms of coding in R compared with the previous R-examples, the
similar R-commands and output are dropped.)
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | -1.422866 .3653636 -3.89 0.000 -2.14106 -.7046719
fitness_c | 1.555477 .3485678 4.46 0.000 .8702987 2.240656
|
smoke |
sometimes | -1.292541 .4659708 -2.77 0.006 -2.208498 -.3765833
daily | -.9184787 .4975576 -1.85 0.066 -1.896526 .0595687
|
cost | -.4181837 .1189421 -3.52 0.000 -.6519878 -.1843795
age | -.034727 .0387085 -0.90 0.370 -.1108163 .0413622
_cons | 5.670714 .9582337 5.92 0.000 3.787117 7.55431
------------------------------------------------------------------------------
Adjusting for controls, female students exercise for 1.42 less hours than
male students on a weekly basis. That is, female students “similar” to male
ones with respect to fitness center membership, smoking status, subjective
cost restrictions and age exercise for almost 1.5 fewer hours per week than
their male counterparts. The constant of 5.67 has no real-life meaning, since
none of the students have the value of zero on all independent variables in
the regression. Analogously (i.e. all-else-equal), fitness center members exer-
cise for 1.56 more hours than non-members. Both groups of smokers exer-
cise fewer hours than non-smokers; the ceteris paribus differences being 1.29
hours between non-smokers and sometimes-smokers and 0.92 hours between
non-smokers and daily-smokers. The cost variable has a regression coefficient
of about –0.42. Interpreted dynamically it suggests that a one-unit increase
in this variable, say from applies well (3) to applies very well (4), is associated
56 The basics
with a 0.42-hour decrease in exercise hours. More statically and correct we
have that a student answering “applies very well” (4) on average exercises for
1.68 fewer hours per week than a student answering “applies very poorly”
(0) (4 × –0.42 = –1.68). That is, those facing financial constraints on exer-
cising exercise less than those who do not. Age has a tiny negative effect
(–0.03), suggesting that older students exercise slightly less per week than
younger ones. But since this coefficient is close to zero (implying no effect
of age on hours of exercise), we pay no attention to it and conclude that age
does not appear to be associated with hours of exercise. We will later learn,
in Chapter 4, why this is justified.
A regression model’s explanatory power – the R2. So far we have been
interested in the individual coefficients (effects) of each independent vari-
able in the regression. We now turn attention to the regression model per se.
More specifically, we would like to know how “good” our regression model
is in terms of explanatory power. In regression-speak this boils down to
how much of the variation (strictly speaking, variance) in y the x-variables
combined account for. But first, what do we mean when we say regression
model? Simply put, the regression model in Stata-output 3.7 may be written
in algebraic form as
where y = hours of exercise per week, b0 is the constant, the different bs are
the various regression coefficients (effects) to be estimated, and
x1 = gender,
x2 = fitness center membership,
x3 = smoking sometimes,
x4 = smoking daily,
x5 = monetary cost, and
x6 = age.
Visually this regression may be depicted as shown in Figure 3.3. Whether
expressed in algebraic or visual form, our regression model suggests that (the
amount of) exercise hours is an additive function of six independent variables.
The “goodness” of this model is expressed in terms of how much of the vari-
ation in hours of exercise the six independent variables explain altogether.
The so-called R2 is the measure of interest in this regard.
To exemplify how the R2 is calculated, take a look at the SS column in
Stata-output 3.7. The term SS is short for sum of squares, which in simplified
terms we may think of as a measure of the variation in the data. In the third
row of this column (Total) we have the total sum of squares (TSS); that is
5 817.26. TSS tells us how much variation (variance) there is in y altogether in
the data. In the first row (Model) we have the model sum of squares (MSS);
that is 878.50. MSS denotes how much of the variation in y is attributable to
the independent variable in the model. If we divide MSS by TSS we get the R2.
Multiple regression 57
Gender
Fitness center
membership
Smoking somemes
Hours of
Smoking daily exercise
Monetary cost
Age
In our model the R2 is 0.151 or 15.1% (878.50/5 817.26 = 0.151). The six
independent variables in the regression model explain about 15% of the vari-
ation (variance) in the dependent variable. In other words, 85% of the vari-
ation in hours of exercise stems from independent variables not included in
the regression model and/or randomness. We find the R2 on the right-hand
side of Stata-output 3.7, denoted R-squared. Every statistics program
doing regression reports this measure. In the program R, the R2 is reported at
the bottom of the output as Multiple R-squared (see e.g. R-output 3.2).10
Students understandably want to know if there is a threshold value for R2
suggestive of a “good” model. Unfortunately, there is no such thing. Whether
our model of exercise hours is acceptable or not must be judged against a yard-
stick. If some researchers have earlier presented a similar model of exercise
hours explaining 25% of the variation, we might have a problem in the sense
that our model may lack important (control) variables. If these researchers’
R2 was only 10%, in contrast, our model looks better on the face of it.11 That
said, few would argue against the idea that a model explaining much of the
variation in y is preferable to a model explaining very little. Finally, it should
be noted that analyses of questionnaire data very seldom yield an R2 larger
than, say, 50%, whereas R2 scores between 10 and 20% are very common
indeed.12
A new case: The price of cognac. Below we examine how the price of cognac is
associated with certain key attributes of this liquor. To address this issue, we
use a data set on 52 bottles of cognac (cognac_price.dta). Descriptive
statistics for the variables in the data appear in Stata-output 3.8.
58 The basics
Stata-output 3.8 Descriptive statistics for the variables price, quality and type.
. tab type
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
quality | 29.25561 4.6775 6.25 0.000 19.86058 38.65065
_cons | -2014.343 386.0812 -5.22 0.000 -2789.81 -1238.876
------------------------------------------------------------------------------
Multiple regression 59
Stata-output 3.10 scrutinizes the relationship between cognac type and
price. Note that VSOP costs about NOK 55 more than VS (the reference) on
average, and that XO similarly costs about NOK 139 more than VS. The price
of VS is approximately NOK 330, as indicated by the constant. (Take a look
at section 3.4 again if these interpretations are unfamiliar.) The variable type
accounts for 55% of the variation in price.
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
type |
VSOP | 54.7265 20.60492 2.66 0.011 13.31936 96.13363
XO | 139.254 18.18379 7.66 0.000 102.7123 175.7957
|
_cons | 329.8889 13.34325 24.72 0.000 303.0746 356.7032
------------------------------------------------------------------------------
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
quality | 13.54774 5.363688 2.53 0.015 2.763325 24.33216
|
type |
VSOP | 42.39458 20.15931 2.10 0.041 1.861563 82.92759
XO | 101.5138 22.82966 4.45 0.000 55.61172 147.4159
|
_cons | -769.7362 435.5369 -1.77 0.084 -1645.442 105.9694
------------------------------------------------------------------------------
Figure 3.4 The overlap between the two lowers circles (in black shading) denotes the
correlation between cognac quality (x1) and cognac type (x2). The non-
overlapping parts of these circles (in gray shading) express the regression
coefficients/bs (effects) that are unique to cognac quality (x1) and cognac
type (x2) in explaining variation in price (y); see also section 3.2.
Multiple regression 61
Notes
1 I also discuss experimental data in Chapters 9 and 10.
2 More on this problem, i.e. the problem of obtaining so-called causal effects, in
Chapters 9 and 10 and sections 11.4 and 11.6.
3 The principle simultaneously also works the other way around: Fitness center
membership is controlled for exercise interest, and exercise interest is controlled
for fitness center membership.
4 In some cases the regression coefficient of interest might increase in size when
controls are added to the model. This is called a suppressor effect; see DeMaris
(2004) for more on this.
5 That is, “relevant” meaning (control) variables affecting y and being correlated
with x1. More on this in section 5.2, i.e. the regression assumption that all relevant
independent variables should be included in the regression model (i.e. no omitted
independent variables).
6 Twenty observations per independent variable in the model should be adequate in
most cases (see Allison, 1999).
7 The group comparison approach is chosen for ease of exposition. The inde-
pendent variables of main interest do not have to be dummies; they could instead
be numerical or ordinal independent variables.
8 Remember that the constant means where the regression line cuts across the y-
axis, which is for x = 0. If a student has the value of x = 0 for some_smoke (i.e. do
not smoke sometimes) and the value of x = 0 for daily_smoke (i.e. do not smoke
62 The basics
daily), he or she must be a non-smoker (i.e. answer never) since there are no other
smoking statuses available in the questionnaire/data.
9 In Chapters 7 and 8 I consider regression analysis for non-numerical dependent
variables.
10 More strictly correct, the sum of squares in question refers to size of the error
term (e) of the regression, which becomes smaller when independent variables are
added to the model (cf. section 2.7). The so-called adjusted R2 takes into account
that a regression with many independent variables by random chance is more
likely to produce a larger R2 than a regression with fewer independent variables.
The adjusted R2 takes away this advantage.
11 If the research question is confined to finding x1’s unique effect on y, which often
is the case, the role and size of R2 is typically of minor importance in practical
research.
12 The type of data analyzed also matters for the size of R2. Cross-sectional data typ-
ically yield lower R2 values than time-series data.
Exercise 3A
Use the housedata.dta. A dummy in this data set is called “sec_living”
and denotes whether or not a house has a second living room (yes = 1, no = 0).
3A1: Estimate the following regression model:
price_Euro = sec_living
price_
Euro = sec_
living + renovate
price_
Euro = sec_
living + renovate + square_
m
Multiple regression 63
Exercise 3A, solution
3A1:
The regression coefficient for sec_living is 107 070, suggesting that
houses with a second living room sell for 107 000 € more on average than
houses without such a room. Houses without a second living room sell for, on
average, 421 254 € (i.e. the constant). The R2 is about 0.09, indicating that the
dummy in question accounts for 9% of the variation (variance) in sale prices.
3A2:
The regression coefficient for sec_living is 104 267, suggesting that houses
with a second living room sell for 104 000 € more than houses without such a
room, controlling for renovate. Similarly, renovated houses sell for 74 756 €
more than non-renovated ones, controlling for sec_living. Since both inde-
pendent variables are dummies (i.e. has the same measurement scale), the results
suggest that a second living room is slightly more important for price variation
than renovation. (The fact that the R2 does not double in size from the bivariate
model is also suggestive of this.) The mean price of a non-renovated house with
no second living room is 386 679 € (i.e. the constant).
3A3:
The regression coefficient for sec_living is now 18 225, i.e. miniscule
compared with the models in A1/A2. This suggests a more or less spurious rela-
tionship between sec_living and price, caused by the variable square_m
as explained in section 3.1. The same logic applies for the variable renovate,
but to a smaller extent; the b is now 38 900. The main driver behind variation
in house prices is, as expected, house size (i.e. square_m). The increase in the
R2 to 0.54 (or 54%) also underscores the latter point.
Exercise 3B
Use the data football_earnings.dta. The variable games_int refers
to total number of international matches played, and the variable age is the
age of the player in question. Both variables are expected to affect yearly
earnings in € (earnings_E).
3B1: Estimate the following regression model:
earnings_E = games_int
earnings_
E = games_
int + age
64 The basics
Exercise 3B, solution
3B1:
The regression coefficient for games_int is 3 044, indicating that a player
with 20 international matches on average earns about 30 000 € more per year
than a player with 10 international matches (10 × 3 044). The R2 of 0.14
suggests that number of international matches explains 14% of the variation
(variance) in yearly earnings.
3B2:
The regression coefficient for games_int is 2 557, controlling for player
age (age), indicating that a player with 20 international matches earns about
26 000 € more per year than a player with 10 international matches (10 ×
2 557). 16% of the association between number of international matches and
earnings is therefore spurious and caused by player age ((3 044–2 557)/ 3 044
≈ 0.16). The regression coefficient for age is 3 794, implying that a 30-year-old
player earns about 19 000 € more per year than a 25-year-old ceteris paribus
(5 × 3 794). But does it make sense that earnings keep on increasing as the
players age indefinitely? More on this in sections 6.3, 6.4 and 6.5. Since the
R2 is only 0.18 compared with 0.16 in B1, the regression model suggests that
games_int is more important for the variation in earnings than age.
Multiple regression 65
Confounder:
An independent variable, z, that affects both x1 and y in such a way that x1-y
relationship is (more or less) false/spurious.
Data:
Collective term for the units/observations (see below) for which we have vari-
able information.
Dependent variable (or y):
The variable we want to explain statistical variation in, based on (variation in)
the independent variable(s).
Dummy variable (dummy):
A variable with only two distinct values, coded 0 and 1.
Error term (e):
An unobserved variable (along with randomness) summing up all effects
on the dependent variable other than those stemming from the inde-
pendent variables in the regression model. Generally speaking, we always
assume e = 0.
Independent variable (or x):
The variable we think is (partly) responsible for the variation in the dependent
variable.
Index:
A combination of several variables into one composite variable.
Intercept:
See constant.
Multiple regression (analysis):
A regression analysis containing two or more independent variables.
Observations:
See units.
Regression (analysis):
A statistical method associating an independent variable, x, to a dependent
variable, y. The aim is to quantify if and to what extent x is associated with y.
Note that a regression might include several independent variables simultan-
eously; see Multiple regression above.
Regression coefficient:
A figure describing to what extent x is associated with y in the data, i.e. the
change in y associated with a one-unit increase in x. Also called b1 or slope.
Residual:
The vertical distance between a data point and the regression line in the data.
66 The basics
Respondents:
Persons who have answered a questionnaire, and as such often the units (see
below) up for analysis in a regression.
R2:
A figure denoting how much of the variation (variance) in the dependent vari-
able that is brought about by all the independent variables in the regression
combined.
Slope:
See regression coefficient.
Subjects:
Persons as in units (see below) up for analysis in a regression, but not neces-
sarily respondents answering a questionnaire.
Units:
The entities we have variable information (or data) on; the input for calcu-
lating b0 and b1 (see above). Often persons or firms in the social sciences. Also
called observations in many cases.
Variable:
Something that varies among units/observations (or with time).
x:
See independent variable.
y:
See dependent variable.
Part 2
The foundations
4
Samples and populations, random
variability and testing of statistical
significance
4.1 Introduction
You have probably heard of the terms “sample” and “population” before.
You might also have a good grasp of what these concepts mean. If not, don’t
worry, as they will be explained below. Up until now I have used the word
“data” exclusively to emphasize that our regression coefficients, or bs, refer to
the data at hand. That is, no reference has been made to what these coefficients
might be “outside” of the data. This chapter is about making such inferences.
There are, to simplify a bit, two kinds of samples: representative samples
and non-representative samples. The former will concern us in this chapter
(but see section 4.6). Representative samples are, as the name implies, intended
to be representative of some larger population. The traits of interest in this
population are most often unknown to us, and we thus use a representative
sample to tell us about these traits. The aim of this introduction is to obtain a
rudimentary first grasp of this process.
You have likely answered a political poll or survey on consumer habits or
some related issues. Often 1 000 people (respondents) or more answer such
surveys, and they are referred to as “the sample”. Yet when researchers pre-
sent results of such surveys, they make claims regarding “a population” pos-
sibly containing millions of people. In other words, they find out stuff about
a small sample but draw conclusions about a large population. How come?
The answer is that before the researchers collected their sample, they took
precautions to ensure that it would end up as representative of the larger
population. If they succeeded, which they most often do, they are entitled to
knowledge of a whole population even though they only have surveyed a small
fraction of this population. This is ingenious in terms of effort and costs!
In a regression we find the b1 for the relationship between x and y in the
data, which we now call a sample.1 (We omit control variables for simplicity.) If
this sample is representative of a population, and if the b1 of interest is clearly
larger (±) than zero, we may in general extrapolate this sample-specific b1 to
the population and claim that it is large enough to warrant attention from a
scholarly or practical point of view. That is, we claim x1 to be associated with
y in the population although we only have examined data from a small sample
of this population. Why we are able to do so is what Chapter 4 explains in a
rough and step-by-step fashion.
70 The foundations
Mean = 64.92
20 10
0
Figure 4.1 Histogram of variable square meter (sq_m), with normal curve imposed.
72 The foundations
z-score
68%
95%
s
SE =
n′
where s is the standard deviation for the variable of interest, and n is the
number of observations in the sample. For the variable square meter (sq_m)
in the apartment data, these figures are 23.52 and 325 (see section 4.2). The
standard error (SE) thus becomes
23.52
≈ 1.305.
325
Now we have what we need to compute the 95% confidence interval for the
mean of square meter (sq_m) in the population of interest, which is
The mean of square meter in the population of apartments lies in the region
between 62.31 and 67.63 m2 with 95% confidence.
Main point: This example shows how to make a rather precise description of
a population variable (here: a mean) based on only the sample equivalent of
this description. But recall that this requires that the sample in question is rep-
resentative of the population which, in turn, assumes that the sample has been
drawn according to some kind of random procedure (see note 2).
The 95% confidence interval for a regression coefficient. By the same logic as in
getting from an exact sample mean to a more imprecise population mean, we
do the equivalent for a sample regression coefficient. The key is first to find the
standard error (SE) of this regression coefficient. We use the apartment data
to illustrate; Stata-output 4.1 reports the regression between the variables
square meter (sq_m) and sale price in euros (price_e).
74 The foundations
Stata-output 4.1 Regression of apartment sale prices (price_e) on apartment sizes
(sq_m).
. use "C:\Users\700245\Desktop\temp_myd\price_sq_m.dta"
------------------------------------------------------------------------------
price_e | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sq_m | 3402.354 293.6466 11.59 0.000 2824.652 3980.055
_cons | 91461.22 20273.53 4.51 0.000 51576.37 131346.1
------------------------------------------------------------------------------
The output contains the information we need to compute the 95% con-
fidence interval for the regression coefficient in the population, since the
formula is
The 95% confidence interval for the square meter coefficient in the popula-
tion of apartments is roughly between 2 800 and 4 000 €.6 In the R program
the SE is reported on the right-hand side of the regression coefficient; see e.g.
R-output 2.1.
The region of acceptance for H0 is from about –600 to 600. Since our sample
regression coefficient is 3 402 –i.e. clearly on the outside of this region –we
reject H0 and gain indirect support for the H1 that square meter has a statis-
tical association/relationship with sale price.
I mentioned that the term “unlikely” needs a precise definition. In statistics,
something is unlikely if it happens 5 times or less out of 100 in repeated sam-
pling. Conversely, it is likely to happen if it turns up more than 95 times out
of 100 in repeated sampling. Informally, we say that something is unlikely if it
has a 5% probability or less to occur. In this sense, getting a sample regression
coefficient of 3 402 like we did above if the true coefficient in the population
is zero, is unlikely. Hence, we reject what is unlikely, i.e. H0, and get indirect
support for H1. In practice, thankfully, testing of hypotheses in statistical
research is less cumbersome than the above procedure. Without further ado,
please welcome significance testing!
Significance testing and t-values. Above we claimed to have gotten support,
albeit indirect, for our H1 –that there was an association between square
meter and sale price in the population. We could instead have said that our
76 The foundations
regression coefficient is “statistically significant”, which means the same thing.
Indeed, this is the typical way of saying that a regression coefficient is “large
enough” to be present in the (unknown) population and perhaps to deserve
attention in a scholarly project.
In practice we divide the sample regression coefficient, b, by its standard
error (SEb) to find out if b is statistically significant. From Stata-output 4.1
this yields 3 402.354/293.647 ≈ 11.59. The result is called a t-value, and all
statistics programs calculate it. Stata reports it under the heading “t”; in R it
is labelled “t value”. If the t-value is larger than ± 2, implying that b lies
outside of the region of acceptance for H0, we deem b to be “statistically sig-
nificant”.9 Since the present t-value is 11.59, our b is statistically significant.
That is, we judge b to be larger than zero in the population in the same way as
we did when we calculated the region of acceptance for H0 and found that b
was on the outside of this region.
Sometimes the t-value of b becomes smaller than 2, indicating that b lies
within the acceptance region of H0. If so, we dare not reject H0, and we thus
keep it. We then normally conclude that the regression coefficient is not “large
enough” to warrant attention in the population.10 Or, as shorthand, we say
that b is “not statistically significant”. In Stata-output 3.7 we saw that age
had a very tiny effect on hours of exercise; the b being –0.03 and the t-value
being –0.90. Since this t-value is below –1.96, we conclude that the effect of
age is not significant (or insignificant).
Significance levels and t-values. So far the presentation has implied that we
discard H0 if it has a 5% or smaller probability of occurring in repeated sam-
pling. In large samples (i.e. n > 120) this tallies with a t-value of 2. Or 1.96
to be exact. That is, a t-value of 1.96 parallels a 5% significance level for b
in large samples. This 5% level is the default in the social sciences. Another
threshold is the 1% significance level. This corresponds to a t-value of 2.6
in large samples. (The relationship between t-values and significance levels is
inverse: The larger the t-value, the lower the significance level. The larger the
t-value, the less likely it is that H0 occurs by chance in repeated sampling.)
What, then, is a t-value? For large samples t-values are similar to the z-values
we saw in Figure 4.2. These t-values, to simplify, measure the distances in the
region of acceptance for H0 as depicted in Figure 4.3. The figure shows that
our t-value of 11.59 is outside the 0 ± 2 t region, which is the 5% significance
level. It is also outside the 0 ± 2.6 t region for the 1% significance level. In other
words, our regression coefficient/b is also statistically significant at the 1% level.
In small samples, t-values must be larger than 1.96 to be deemed significant
at the 5% level. For, say, a x-y regression on 40 observations, the t-value must
be 2.021 in order to reach the 5% level. Every book on statistics has a table
showing these t-value/significance level associations for various sample sizes.
Significance probability or p. On the right-hand side of “t” in Stata-output
4.1 we see the heading “P>|t|”. This column denotes the exact significance
t = 11,6
t -values
Figure 4.3 Region of acceptance for the null hypothesis and its relationship with
t-values.
78 The foundations
for the standard error in section 4.4.), it follows that many tiny regression
coefficients are statistically significant in large samples (n > 1 000). From
this it also follows that we cannot use differences in significance levels to dis-
criminate among independent variables in terms of their relative practical
importance.
The null hypothesis is often stupid! In our example, H0 states that the regression
coefficient for square meter is zero. Given the knowledge of how the housing
market works this appears, well, at least questionable… To come around this,
Stata has the capability to assess an alternative H0 –for example, that our
sample b (i.e. 3 402) is larger than 2 500. After the command yielding Stata-
output 4.1, we simply write “test sq_ m = 2500” in the command window
and press Enter. This yields:
. test sq_
m = 2500
( 1) sq_m = 2500
F( 1, 323) = 9.44
Prob > F = 0.0023
In other words, against the null that the b for square meter is 2 500, the
reported b is significant at the 0.23% level. My limited knowledge of R
prevents me from doing this analysis in that program, but I suspect it is pos-
sible. See Wickham and Grolemund (2016) for an introduction to statistical
analysis in R.
The 5% significance level is arbitrary. A whole issue of American Statistician
(2019, Vol 73, No 51) was recently devoted to the misuse of significance
testing in much current practice. At the center stage of the critical remarks
was the tendency of viewing research findings as either significant/important
(p < 0.05) or insignificant/non-important (p > 0.05). Such binary thinking
should be abandoned! Instead, I encourage students (with the support of a
reviewer of this book) to take a continuous, as opposed to binary, perspective
on significance testing.
Significance testing on non-representative samples and populations. Recall the
introduction to this chapter: A sample generated by some random mechanism
lies at the core of significance testing as it is presented here. But many of the data
used by social scientists/economists today are not samples in this sense, such as
convenience samples and populations. May we still apply significance testing in
the customary manner? Some are skeptical (Berk, 2004; Freedman et al., 1998),
and some are not (Hoem, 2008; Lumley, 2010). The prevailing practice seems to
be to treat the data, no matter how they were generated, as a miniature of some
imaginary population and, with this justification, carry on doing significance
testing in the traditional manner. For more on this, see Sterba (2009).
One-sided and two-sided tests of significance. The presentation so far has
referred to what is known as a two-sided significance test, i.e. the default in
Notes
1 Many types of data on which regression is used are not (random) samples in the
traditional sense, such as the soccer player data analyzed in section 2.4 and the
sausage data analyzed in section 2.6. More on this in section 4.6.
2 In practice, more complicated random draw procedures are usually used, but this
is beside the key point. See for example Freedman et al. (1998) in this regard.
3 This illustration is a simplification of the so-called Central Limit Theorem (CLT).
For the CLT to work properly, the sample should be greater than 50 and the popu-
lation preferably much greater than 400. But the take-home message of the CLT
is that a large sample (i.e. n > 120) from a well-defined population drawn by some
80 The foundations
random sampling mechanism will approximately resemble the population from
which it is drawn. For more on this in a non-technical fashion, see Wheelan (2013).
4 Strictly speaking, the exact number is 1.96. But, as the saying goes, 2 is close
enough for government work.
5 The formula for computing the standard error is complicated and thus not some-
thing to be bothered with when we have statistics programs at our disposal. See for
example Fox (1997), Allison (1999) or Wooldridge (2000).
6 The interval differs from the interval under the heading of “95% Conf. Interval”
due to rounding, and the fact that Stata uses 1.96 in the calculation (see note 4).
7 The alternative hypothesis in (a) suggests a two-sided significance test, whereas the
alternative hypothesis in (b) suggests a one-sided significance test. See section 4.6
for more on this.
8 Testing the alternative hypothesis is a verification or confirmation approach;
testing the null hypothesis is a falsification approach. The latter is the dominant
one taken in statistical research.
9 Assuming a large sample, which in practice is a sample of approximately 120 units
or more. If the sample size is smaller than this, t-values must be slightly larger
than 2.
10 A low t-value could be the result of a small numerator (i.e. the regression coeffi-
cient/b) or a large denominator (i.e. the standard error of b), but is in my experi-
ence most often the former.
11 Note that a low significance probability (p) for H0 is not the same as a high prob-
ability of H1 being true.
5
The assumptions of regression analysis
5.1 Introduction
Chapter 4 explained how to get a confidence interval for a population regres-
sion coefficient (i.e. an unknown parameter) from a precise sample regression
coefficient (i.e. an estimate) provided that everything was well with the regres-
sion model in question. In this sense, Chapter 4 was an idealistic enterprise.
What, then, is the worst thing that can go wrong in a regression analysis,
given that the statistics program crunched the correct numbers and gave you
an output to consider? Probably that your all-else-equal regression estimate
of interest, b1, is nowhere close to its corresponding real-life but unknown
population regression coefficient.1 How can we avoid this, i.e. to get a biased
(i.e. too high or too low) estimate of b1? By making sure our regression model
meets the assumptions of regression analysis. That is, provided the regression
lives up to the assumptions of regression analysis, we have sound reason to
believe that our b1 estimate is near in magnitude to its real-life equivalent.
Allison (1999) stated 20 years ago that there was lots of confusion about the
assumptions of regression analysis. In my assessment this has not improved.
For example, some assumptions are less important than others, some are vital
in certain settings but not in others and some are, on closer inspection, not
assumptions at all. The aim of this chapter is therefore to explain and illus-
trate the regression assumptions in a more intuitive fashion than what appears
as the norm in most prior textbooks on the subject. The presentation assumes
that the focal point of the analysis is to get an unbiased all-else-equal estimate
of b1 that comes as close as possible to this coefficient’s real-life but unknown
counterpart. That one also wants an unbiased all-else-equal estimate of b2 or
b3 with this property does not complicate matters –at least not in principle.
5.2 The assumptions you should think about before doing your
regression and, quite often, before you choose your data (A)
A1 All relevant independent variables are included in the regression model.
This most important assumption of regression, given the aim of finding the
unbiased all-else-equal effect of x1 on y, is untestable in strict terms. When all
82 The foundations
relevant independent variables are in included in a regression, the estimate of
b1 will not change when a new independent variable is added to the model (but
the SE of b1 might). In Chapter 3, in contrast, we saw several examples of bs
changing in size when a new independent variable was added to the regres-
sion. Technically, we refer to this as omitted variable bias. Indeed, avoiding
omitted variable bias is the major motivation for doing multiple regression
instead of only bivariate regression. The point now is that it is always conceiv-
able that some variable that we do not have access to in our data, if this was to
be included in our regression, might bring about a change in the size of b1. In
this sense the value of b1 is always uncertain to some extent. This has one key
implication: Before embarking on a regression project aiming at finding the
unbiased effect of x1 on y, you should make sure your data include all relevant
independent (control) variables of y.2 In practice, though, since this typically
is impossible, you must suffice with hopefully the most relevant controls. And
you must be prepared to argue why this is the case.
A2 All non-relevant independent variables are excluded from the regression
model. This assumption is often mentioned in textbooks, and this is why
I bring it up. In the real life of doing regression, though, it plays a minor
role. Indeed, if a regression contains one or two non-relevant independent
(control) variables compared to five or six relevant ones, this generally has
no practical consequences for the independent variable of focal interest (cf.
Wooldridge, 2000). More importantly, many regressions aim towards finding
out whether or not a specific and perhaps policy-relevant x1 has an impact
on y. If such an analysis shows that x1 has a minor and non-significant effect
on y, this obviously does not make it “non-relevant” in the regression model.
That is, a non-finding could be an important finding in a number of cases.
. use "C:\Users\700245\Desktop\temp_myd\housedata.dta"
------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
square_m | 2587.724 217.0158 11.92 0.000 2157.007 3018.44
dist_city_c | -23198.13 3836.473 -6.05 0.000 -30812.47 -15583.8
_cons | 155917.4 34448.68 4.53 0.000 87546.26 224288.5
------------------------------------------------------------------------------
What if the scatterplot smoother shows one or more clear deviations from
a straight line? Then we must turn to the tools available for us in sections 6.3,
6.4 and 6.5.
B3 No perfect multicollinearity. Let’s say the earnings for a sample of employees
is the dependent variable. Suppose further that the research question is to find
out how years of working experience and years of age explain variation in
these earnings. Assume further, for simplicity, that everybody in the sample
started to work at the firm in question right out of college, and that no one
84 The foundations
Figure 5.1 House sale prices by square meter and distance to city center, with
multivariable scatterplot smoother lines. 100 houses sold in a small
Norwegian city during 2017.
Graph generated in Stata.
------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
square_m | 2517.588 238.267 10.57 0.000 2044.568 2990.607
dist_city_c | -22426.32 3898.151 -5.75 0.000 -30165.12 -14687.51
renovate | 25259.94 17293.2 1.46 0.147 -9071.401 59591.28
sec_living | 1688.684 22708.81 0.07 0.941 -43394 46771.36
_cons | 151005.1 35092.74 4.30 0.000 81337.17 220673
------------------------------------------------------------------------------
. vif
86 The foundations
this reason, the b will have a harder time reaching statistical significance.
(Remember: b/SEb = t-value.) What, then, can be done with “too high”
collinearity? Authors diverge in their recommendations. One solution is to
add more units to the regression (if possible), hoping that some of these
new observations have lower correlations than the existing ones between the
independent variables causing the problem. Another solution is to create
an index of the independent variables in question (cf. section 2.5). A third
is to drop one or more of the independent variables causing the problem,
but this may be at odds with the A1 assumption. Finally, an old tune says
“do nothing” if the regression coefficients of interest are theoretically rea-
sonable and significant despite having large standard errors. It should also
be mentioned that “high correlations” between independent variables only
is a concern for the independent variables in question, and that this has no
bearing on the regression coefficients for the other independent variables in
the regression model.
In short: Although only perfect multicollinearity is a strict violation of
the regression assumption, the more common “too high” collinearity problem
leads to inflated standard errors and an identification problem for the b of
interest: We cannot trust, even when all other regression assumptions are met,
that the regression coefficient is the unbiased all-else-equal effect.
The aforementioned regression assumptions are mentioned in every text on
the subject. The next one is typically not, but see Wooldridge (2000).
B4 No influential outliers. Take a look at Figure 5.2 (former Figure 1.1).
At least three of the houses in the graph, i.e. those pointed at by arrows,
are located a bit away from the other houses. In this sense they are poten-
tial outliers. Now, think of two y = b0 + b1x1 regressions. The first is run
on all the houses in the data; the second on all houses minus the potential
outliers. If these regressions yield the same estimates of b1, you do not have
an outlier problem. In contrast, if they produce dissimilar estimates of b1,
you have a problem. That is, your outliers are influential. What to do? In
some cases, say when outliers are few compared to the total number of
units, it could be ok to delete them. In others, say when analyzing a sample
representative of some population, deleting observations might cause your
sample to become non-representative. This is generally not ok. There are
lots of diagnostic tools available for identifying outliers; see for example
Meuleman et al. (2015). In the house data above there is no influential out-
lier problem.
Summing up, if the assumptions in sections 5.2 and 5.3 are met, you could
be pretty sure that your b1 of main interest comes close to its unknown, real-
life counterpart. But sometimes, if not most often, your data is a random
sample from a well-defined population. In this case we need some additional
assumptions to make valid inferences from the sample b1 to its population
counterpart. This is the topic of section 5.4.
Figure 5.2 House sale prices by house sizes. 100 houses sold in a small Norwegian city
during 2017.
88 The foundations
Figure 5.3 Earnings per year by player performance, with regression line. 242 players
in the Norwegian version of the Premier League.
Graph generated in Stata.
. estat hettest
Breusch-
Pagan /Cook-
Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of EarnMill
chi2(1) = 26.87
Prob > chi2 = 0.0000
Figure 5.4 Earnings per year by player performance, residuals versus predicted values
plot. 242 players in the Norwegian version of the Premier League.
Graph generated in Stata.
90 The foundations
Stata-output 5.3 Regression of earnings (earnings_E) on player performance
(performance), with robust standard error (SE).
. use "C:\Users\700245\Desktop\temp_myd\football_earnings.dta"
------------------------------------------------------------------------------
| Robust
earnings_E | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
performance | 63176.56 12775.92 4.94 0.000 38009.31 88343.81
_cons | -173364 56597.15 -3.06 0.002 -284854.6 -61873.36
------------------------------------------------------------------------------
. swilk r
Shapiro-
Wilk W test for normal data
The significance probability on the right-hand side suggests that the null
hypothesis of normality of the residuals is clearly rejected (p < 0.000001).
There is no need to go into the technicalities of this test and related ones.
What is the consequence of the normality of residuals- violation? As
mentioned, in large samples, the Central Limit Theorem ensures that non-
normal residuals are unproblematic. For small samples, in contrast, non-
normality of the residuals might entail that we cannot trust the tests of
significance (i.e. p values) to be accurate.
C3 Uncorrelated residuals. This assumption is foremost a question of how
the data were generated. If your data is a random sample of people having
nothing in common other than randomly being drawn out of a bucket, you
have little to worry about. That is, this assumption means that what person
A answers in his questionnaire is uncorrelated with (i.e. independent of)
what B answers in his or hers. But let’s say ten random pupils in each class
at an elementary school answer a questionnaire, and that the y-variable is
an assessment of the pupils’ learning environment on a 1 to 10 scale. Will
the answers to this question be uncorrelated among the school’s pupils?
Probably not, since pupils in the same class share the same learning environ-
ment. Also, consider the earnings of the soccer players previously analyzed
in this chapter. Are these uncorrelated among the players? Probably not,
since the earnings of players on the same team are dependent of each
other due to a fixed salary budget. These two exceptions aside, correlated
residuals are under normal circumstances not something to worry about
92 The foundations
when analyzing cross- sectional data. When the data are longitudinal,
such as repeated observations for the same person, firm, stock option or
country, the problem of correlated residuals is pervasive and must be dealt
with accordingly.6
What is the consequence of correlated residuals within a cross-sectional
framework? As before the problem is that the SE of b1 gets biased downwards.
That is, the statistics output will report a SE which is smaller than it really is.
For this reason, one might conclude that b1 is significant when it is not. The
solution is to calculate a SE which takes the clustering of observations (in
classes, in teams etc.) into account. This cluster-robust SE appears in Stata-
output 5.4 for the soccer players’ earnings regression (i.e. 22 727). It is much
larger (indicating less precision) than the normal SE (11 794; Stata-output
2.8) and the robust SE (12 776; Stata-output 5.3). For this reason, the level
of significance of the regression coefficient for player performance is much
higher than in previous analyses (p < 0.015).
To obtain a cluster-robust SE in Stata, you add “, cluster(variable
name)” to the usual regression command. (Here, the variable club_rank
specifies that a cluster-robust SE taking the 16 different teams in the data into
account should be calculated.)
Notes
1 This applies to cases where the data are random samples from a well-defined popula-
tion. In cases where they are not, which increasingly is the case in economics and the
social sciences, the aim of the regression would be to get the unbiased real-life and
all-else-equal regression coefficient without any reference to a specific population.
2 If you collect your own data, for example through a questionnaire or through
detective work on the Internet, you have in general more to say about control vari-
able completeness than when using datasets gathered by others.
94 The foundations
3 Say we have the regression y = b1x1 + b2x2. To identify the all-else-equal effects
of x1 and x2, as expressed by b1 and b2, we need some observations for which x1
does not change when x2 changes or vice versa. In other words, in the presence of
multicollinearity, it is impossible to increase x1 by one unit while holding x2 con-
stant (Berry, 1993).
4 The column “1/VIF” contains the so-called tolerance values, which some prefer to
VIFs. A VIF value of 2.5 is thus a tolerance value of 0.4.
5 We assume that these data pertain to a random sample of soccer players, although
we know for a fact they actually pertain to a population of such players.
6 The typical test for serial correlation is the Durbin- Watson test (see
Wooldridge, 2000).
Exercise 5A
Use the housedata.dta.
5A1: Estimate the following regression model:
price_Euro = sec_living
What does this bivariate regression tell you regarding statistical significance?
5A2: Estimate the following regression model:
price_
Euro = sec_
living + renovate
What does this multiple regression tell you regarding statistical significance?
5A3: Estimate the following regression model:
price_
Euro = sec_
living + renovate + square_
m
(i) What does this multiple regression tell you regarding statistical
significance?
(ii) What does this multiple regression tell you regarding multicollinearity,
homoscedasticity, normally distributed errors and uncorrelated
errors? (We already know, from Figure 5.1, that square_m has a
linear effect on price.)
Exercise 5B
Use the data football_earnings.dta.
5B1: Estimate the following regression model:
earnings_E = games_int
What does this bivariate regression tell you regarding statistical significance?
5B2: Estimate the following regression model:
earnings_
E = games_
int + age
(i) What does this multiple regression tell you regarding statistical
significance?
(ii) What does this multiple regression tell you regarding linearity,
multicollinearity, homoscedasticity and normally distributed errors?
(We already suspect, from section 5.4, that the errors probably are
correlated.)
96 The foundations
Exercise 5B, solution
5B1:
The t-value for the regression coefficient is 6.36, and the significance prob-
ability (p) is < 0.0001. The b for games_int is thus significant by conven-
tional standards. A regression coefficient of 3 044 is, in other words, very
unlikely to appear in repeated sampling if the true regression coefficient in the
population is zero.
5B2:
(i) The t-values for the regression coefficients are 5.18 (games_int)
and 3.21 (age), and the significance probabilities are, respectively,
< 0.0001 and 0.002. Both coefficients are thus significant by conven-
tional standards. Both coefficients are so large that it is very unlikely
that they will come up in repeated sampling given true regression
coefficients of zero in the population.
(ii) Using the multivariable scatterplot smoother (Figure 5.1), we see that
none of the independent variables have a linear effect on earnings.
The remedy for this is covered in sections 6.3, 6.4 and 6.5. With no
VIF-scores being larger than 1.10, and a mean VIF-score of 1.10,
multicollinearity is of no concern. Testing for heteroscedasticity, we
firmly reject the null hypothesis of homoscedasticity (p < 0.00001).
One possible solution to this problem is covered section 6.5. The
errors are nowhere close to being normally distributed, but this is
of no concern since the analysis is based on 242 players (i.e. the
“sample” is large).
Part 3
The extensions
6
Beyond linear regression
Non-additivity, non-linearity
and mediation
6.1 Introduction
Sometimes the complexities of life entail that the linear regression model
paints a too simplistic picture of reality. If under these circumstances we hold
on to the linear model, the estimate of b1 for x1 on y could be biased. The pur-
pose of this chapter is to illustrate the ways in which we may adapt the linear
regression model to these more complicated circumstances. We first address
the violations of assumption B1 (section 6.2) and B2 (sections 6.3 to 6.5).
Finally, section 6.6 takes up the issue of mediation which will be explained
when we get there.
Earnings
Years of educaon
Figure 6.1 Stylized relationship between earnings (y) and years of education (x1) for
men (dashed line) and women (solid line).
. use "C:\Users\700245\Desktop\temp_myd\stud_phys_17.dta"
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | -1.415949 .2547573 -5.56 0.000 -1.91619 -.9157081
friend_phys | .5927305 .1633656 3.63 0.000 .271946 .9135149
_cons | 4.125651 .4679033 8.82 0.000 3.206877 5.044425
------------------------------------------------------------------------------
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .5774496 .8912158 0.65 0.517 -1.172545 2.327444
friend_phys | 1.087477 .267327 4.07 0.000 .5625532 1.612402
wom_f_p | -.786476 .3370497 -2.33 0.020 -1.448308 -.124644
_cons | 2.851303 .7181326 3.97 0.000 1.441175 4.26143
------------------------------------------------------------------------------
0 4
Friends' exercise level level (0 = lowest; 4 = highest)
male female
Figure 6.2 Visual illustration of the interaction effect between gender and friends’
exercise levels, based on the regression model reported in Stata-output 6.2.
x1
------------------------------------------------------------------------------
earnings_E | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 45292.13 12193.37 3.71 0.000 21271.93 69312.33
agesquared | -726.4627 222.5748 -3.26 0.001 -1164.922 -288.0039
_cons | -556343.5 163076.3 -3.41 0.001 -877594 -235093.1
------------------------------------------------------------------------------
600
Earnings in thousand euros
200 0 400
20 25 30 35 40
Age of player
earnthousands Linear regression
Quadratic regression
non-linearity and again prefer the simpler, linear model using Occam’s razor.
In our case, the b for agesquared is significant at p = 0.001, implying a non-
linear relationship. Preferring a linear model over a non-linear one when there
is evidence of non-linearity amounts to the same kind of specification error as
holding on to an additive model when there is evidence of non-additivity (see
section 6.2). Figure 6.4 contrasts the linear regression (solid regression line)
with the quadratic regression (dashed regression line).8
When the effect of x is positive and the effect of x2 is negative, the rela-
tionship between x and y has an inverse U-form. When the effect of x is nega-
tive and the effect of x2 is positive, the relationship between x and y has the
U-form. In the latter case the formula above yields a minimum point.9
------------------------------------------------------------------------------
earnings_E | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age21_23 | 46350.58 22669.46 2.04 0.042 1690.229 91010.92
age24_26 | 86192.5 23134.42 3.73 0.000 40616.15 131768.8
age27_29 | 99667.94 23134.42 4.31 0.000 54091.59 145244.3
age30_32 | 102007.4 23223.17 4.39 0.000 56256.19 147758.6
age33ormore | 99896.96 24367.35 4.10 0.000 51891.65 147902.3
_cons | 40859.49 18819.6 2.17 0.031 3783.62 77935.36
------------------------------------------------------------------------------
The coefficient for age21_23 says that players aged from 21 to 23 years
earn about 46 000 € more on average than players aged between 18 and
22 (the reference). The latter reference group earns about 41 000 € on a
yearly basis (the constant). The remaining dummy coefficients follow the
same pattern in terms of interpretation. For example, players aged from
30 to 32 years earn 102 000 € more than the reference category on average.
The non-linearity in the earnings-age association is present in that the three
oldest age groups (27–29, 30–32 and 33 or more) have rather equal annual
earnings. That is, the earnings do not increase steadily with aging as implied
by linearity.
In summary, both the quadratic and the sets of dummy variables approach
may be used to handle non-linearity between x1 and y. The latter procedure in
general requires more observations, though. A third approach for addressing
non-linearity, coming up now, is the use of logarithms.
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inches | 526.8323 32.57528 16.17 0.000 462.33 591.3346
_cons | -14299.58 1468.823 -9.74 0.000 -17208 -11391.17
------------------------------------------------------------------------------
Figure 6.5 TV price by TV size in inches, with scatterplot smoother. 121 TVs in the
Norwegian market.
Graph generated in Stata.
------------------------------------------------------------------------------
log_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inches | .0574962 .002476 23.22 0.000 .0525934 .0623989
_cons | 6.256483 .1116436 56.04 0.000 6.035417 6.477548
------------------------------------------------------------------------------
Back to the problem. Does this solve the non-linearity in Figure 6.5?
Figure 6.6 shows the analogous scatterplot smoother for the log of TV price
and TV size. It shows a fairly linear relationship between the two variables.
Logging y in regressions like this thus often amounts to “linearizing” a non-
linear relationship between x1 and y.13 When should we use the log of y instead
Figure 6.6 TV price (logged) by TV size in inches, with scatterplot smoother. 121 TVs
in the Norwegian market.
Graph generated in Stata.
This yields 100 × (2.718280.494–1) = 63.9% in our case. Smart TVs cost 64%
more than non-smart TVs, controlling for TV size. (e0.494 equals taking the
antilogarithm of 0.494.)
x1
Figure 6.7 Dashed line implies b1 > 0 if y is logged, and solid line implies b1 < 0 if y
is logged.
------------------------------------------------------------------------------
log_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inches | .0480655 .0021158 22.72 0.000 .0438756 .0522554
smart | .4940654 .0518822 9.52 0.000 .3913245 .5968062
_cons | 6.38027 .085303 74.80 0.000 6.211347 6.549193
------------------------------------------------------------------------------
x1
x2
Figure 6.8 Traditional multiple regression model where the effects of the two inde-
pendent variables are controlled for each other. The bidirectional arrow
symbolizes the fact that the two independent variables share a correlation.
x1
x2
Figure 6.9 Path model where x1 affects y directly (solid arrow) and indirectly through
x2 (dashed arrow and then dotted arrow).
Figure 6.10 Path model where parents’ exercise interest affects students’ hours of exer-
cise directly (solid arrow) and indirectly through students’ exercise interest
(dashed and then dotted arrow).
simplest type of path model, i.e. the model in Figure 6.9. Variable x1 is
parents’ interest in exercising (par_exe_int), x2 is students’ interest in
exercising (exercise_int) and y is students’ hours of exercise per week
(phys_act_h).16 The data are fem_stud_sport.dta. The two inde-
pendent variables are indexes ranging from 0 (minimum interest) to 6 (max-
imum interest). (See section 2.5 if you have forgotten about indexes.) The
path model is presented in Figure 6.10.
Three regressions must be carried out to capture the (a) total effect, (b) direct
effect and (c) indirect effect of parents’ exercise interest in Figure 6.10:
Regression (1) estimates the total effect, and regression (2) estimates the
direct effect. The indirect effect (c) is found by multiplying the regression
coefficient for students’ exercise in (2) (dotted arrow) by the regression
coefficient in (3) (dashed arrow). The results of the regressions appear in
Stata-output 6.8.
The top regression (1) suggests that the total effect is 0.465. The middle
regression (2) suggests a direct effect of 0.201. Finally, the indirect effect is
0.264 (1.327 × 0.199) based on the middle (2) and bottom (3) regression.
(Note again that total effect = direct effect + indirect effect.)
The upshot? Were we to examine the effect of parents’ exercise interest
on students’ hours of exercise by standard multiple regression, we would
suggest an effect of 0.201 (i.e. the direct effect). That is, we would under-
estimate the total effect of this variable, which is 0.465 (i.e. direct + indirect
effect).17
. use "C:\Users\700245\Desktop\temp_myd\fem_stud_sport.dta"
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
par_exe_int | .4652037 .136924 3.40 0.001 .1957168 .7346907
_cons | 1.505246 .4920277 3.06 0.002 .5368623 2.47363
------------------------------------------------------------------------------
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
par_exe_int | .2007574 .1197773 1.68 0.095 -.0349856 .4365005
exercise_int | 1.32725 .1275867 10.40 0.000 1.076137 1.578363
_cons | -3.634039 .6488283 -5.60 0.000 -4.911048 -2.357029
------------------------------------------------------------------------------
------------------------------------------------------------------------------
exercise_int | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
par_exe_int | .1992438 .0537792 3.70 0.000 .0933983 .3050893
_cons | 3.872131 .1932521 20.04 0.000 3.491782 4.25248
------------------------------------------------------------------------------
Notes
1 In real-life academic settings (yes, there is such a thing!), several controls would
be included in this regression. I omit these here, and in the remaining chapter
7
A categorical dependent variable
Logistic (logit) regression
and related methods
7.1 Introduction
The y-variables in this book so far have been numerical, i.e. variables with
a long range of numbers as possible values. In such cases, linear regression
along with its extensions presented in Chapter 6 most often works very well,
even in the presence of mild violations of the regression assumptions. It is
thus said that linear regression is a robust method. Quite often, however,
research involves an y-variable with a few, categorical values. A special case is
the binary variable; what we call a dummy. This chapter begins with analyzing
a dummy y and finishes off with the analysis of a y with three unordered
values/categories (see section 7.6).
Dummy dependent variables flourish in the social and economic sciences.
Here are some examples:
And the list goes on… In short, we are talking about dependent variables that
might be thought of as the presence (1) or no-presence (0) of something. Or,
in the framework of questionnaires, a question that directly or indirectly may
be answered with yes (1) or no (0).2
. tab fixed_pm
. sum fixed_pm
In the upper part of the output we see that about 73% of the students
prefer to pay a fixed amount per month. Conversely, 27% prefer paying per
time. The lower part tells us the same thing, but expressed as a probability or
chance. The probability (chance) that a random student prefers a fixed price
------------------------------------------------------------------------------
fixed_pm | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .0268293 .0055132 4.87 0.000 .0160037 .0376548
|
fitness_c |
yes | .1863895 .0353186 5.28 0.000 .1170388 .2557403
_cons | .5078247 .0300811 16.88 0.000 .4487583 .5668912
------------------------------------------------------------------------------
The exercise effect suggests that exercising for, say, six hours per week
entails a 2.7 percentage points larger probability of preferring a fixed fee than
exercising for five hours. Similarly, present fitness center members have an
18.6 percentage points larger probability of preferring a fixed fee than non-
members. Both effects are significant and controlled for each other.
The LPM has intuitive appeal, as it requires nothing we haven’t already
learned. But it also has some potential problems. The first is that the data
points will not lie close to the linear regression line (i.e. poor model fit).
The second is that predictions from the LPM might be smaller than 0 or
larger than 1, something which is impossible for a proportion. That is,
the probability of something occurring (like preferring fixed pay) must
always be in the 0–1 region, just as a proportion is always between 0 and
100%. The third and minor problem is that the error term, e, always will be
heteroscedastic and non-normally distributed. We’ll get back to whether
or not these issues are serious enough to abandon the LPM altogether in
section 7.4.
------------------------------------------------------------------------------
fixed_pm | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .1817575 .037516 4.84 0.000 .1082274 .2552875
|
fitness_c |
yes | .9830869 .2037055 4.83 0.000 .5838314 1.382342
_cons | -.1938905 .1652578 -1.17 0.241 -.5177898 .1300088
------------------------------------------------------------------------------
Two features are of note in the output. First, both independent variables
have significant effects on the pay preference dummy; z-values correspond
roughly to t-values.3 Second, the signs of the two logistic regression coefficients
(+) tell us that students exercising a lot have a larger probability of preferring
a fixed fee than those exercising little, and that current fitness center members
have a larger probability of preferring a fixed fee than non-members. Both
findings are in keeping with the results of the LPM in Stata-output 7.2. But
other than this, the logistic coefficients unfortunately do not tell us much
about the magnitude of these effects –as opposed to the similar effects in the
LPM. Why? The answer is to be found in the upcoming mathematical proper-
ties of the logistic (logit) regression model.
The statistical model actually estimated in logistic (logit) regression is
p
Ln = b0 + b1 x1 + b2 x2 + bn xn ,
1 − p
line is not linear but sigmoid/S-shaped, as illustrated in Figure 7.1 in the case
with one independent variable, x1. The non-linear transformation thus solves
the model fit problem (partly) and the prediction problem, which are now in
the 0–1 region. But this rescue operation comes at a price we have seen: non-
informative coefficients, i.e. coefficients not having the “if x goes up with one
unit, y changes with… interpretation”. On the contrary, the figure shows that
the effect of x1 –i.e. “if x1 goes up with one unit, y changes with…” –is not
constant, but contingent on where on the x-axis the increase in x1 takes place.
pm) on weekly
Stata-output 7.4 Logistic regression of payment preference (fixed_
hours of exercise (phys_act_h) and fitness center membership
(fitness_ c), with marginal effects.
Output omitted …
------------------------------------------------------------------------------
fixed_pm | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .1817575 .037516 4.84 0.000 .1082274 .2552875
|
fitness_c |
yes | .9830869 .2037055 4.83 0.000 .5838314 1.382342
_cons | -.1938905 .1652578 -1.17 0.241 -.5177898 .1300088
------------------------------------------------------------------------------
. margins, dydx(*)
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .031917 .0062164 5.13 0.000 .0197331 .0441009
|
fitness_c |
yes | .1772555 .0357648 4.96 0.000 .1071577 .2473533
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
eb0 + b1 x1 + b2 x2
.
1 + eb0 + b1 x1 + b2 x2
This yields (see the constant and the logistic regression coefficients in Stata-
outputs 7.3 or 7.4)
A student presently not member of a fitness center who exercises for four
hours per week has a 63% probability of preferring fixed pay. What if the
“same” student is a fitness center member? By replacing 0 with 1 in the
equation above (i.e. 0.19831 × 1), we get ≈ 82%. The otherwise similar student
has an 82% probability of preferring a fixed pay form, i.e. a 19 percentage
points larger probability. To get the predicted probability for students exer-
cising very much, say 12 hours per week, we replace 4 with 14 in the equation
and so on.
To do the calculations above you don’t need a statistics program. But
Excel might help. In Stata these calculations may be done semi-automatically.
Stata-output 7.5 illustrates for four hours of exercise and not being a fitness
center member. (Immediately after the logistic (logit) regression you write the
command “margins, at(phys_ act_h=4 fitness_ c=0)”.)
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | .630215 .0277869 22.68 0.000 .5757537 .6846763
------------------------------------------------------------------------------
As we see, Stata yields the same result (i.e. 0.6302) as did our hand
calculation above.
In most practical applications of logistic regression there are several inde-
pendent variables and not only two. This makes computing of predicted
probabilities more cumbersome by hand, but the principle shown above still
applies. Also, since only one or two independent variables are of main interest
in most settings, control variables may often be set to their mean/median
values. A possible exception is dummy controls, where means often are mean-
ingless.8 Predicted probabilities for different values of a numerical or ordinal
x-variable may also be put into an illustrative graph.
------------------------------------------------------------------------------
fixed_pm | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .1783637 .0496193 3.59 0.000 .0811116 .2756157
gender | -.2649659 .345209 -0.77 0.443 -.9415631 .4116312
womphys | .1431828 .0721017 1.99 0.047 .001866 .2844995
_cons | .0666088 .2790297 0.24 0.811 -.4802794 .6134969
------------------------------------------------------------------------------
Since the logistic regression coefficients have no intuitive metric, the main
concern is the sign and significance level of the interaction term. The fact
that it is significant (p < 0.05) tentatively suggests that the effect of hours
of exercise on preference for a fixed pay structure is different for men and
women.9 The sign of the interaction term womphys (+) indicates that the
exercise effect is stronger (0.321; 0.178 + 0.143 = 0.321) for women than for
men (0.178). A graph aids interpretation; see Figure 7.2. Note the somewhat
steeper regression line for women (black line). Whether or not this gender
difference is large enough to warrant attention is another matter, however.
Note also that the two non-linear regression lines are a direct consequence
of the inherent non-linearity in the logistic regression model. This does
not imply that these relationships are non-linear in real life. To examine
the latter, we must address non-linearity explicitly. This is the next topic
coming up.
Non-linearity. In the same vein as for linear regression models, the relation-
ship between x1 and a dummy y may be non-linear in real life. Two of three
procedures illustrated in Chapter 6 are relevant in the logistic case: quad-
ratic models and sets of dummy variables. (The use of logarithms is not
viable, since this feature is already embedded in the logistic model.) The
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Hours of exercise per week
male female
Figure 7.2 Visual illustration of interaction effect between gender and hours of exer-
cise per week, based on the logistic (logit) regression model reported in
Stata-output 7.6.
procedures are exactly the same in the linear and the logistic case, but
one should in the latter case also transform the logistic coefficients into
marginal effects and/or predicted probabilities or graphs to get them
interpretable.
Model fit. Model fit could be relevant for logistic regression models, but not so
much as in the linear regression case. The reason is that there is no universal
measure of model fit, such as R2. In contrast, there are several pseudo R2
measures available. The default in Stata is McFadden’s pseudo R2; see Stata-
output 7.6. This roughly suggests that the independent variables explain 8%
of the “variation” in pay preference form. Using the command “fitstat”
immediately after the logistic regression model yields a suite of different
pseudo R2 measures, as illustrated in Stata-output 7.7. The pseudo R2 values
range from 3 to 17% in the present case.
. fitstat
| logit
-------------------------+-------------
Log-likelihood |
Model | -356.887
Intercept-only | -388.026
-------------------------+-------------
Chi-square |
Deviance(df=657) | 713.773
LR(df=3) | 62.279
p-value | 0.000
-------------------------+-------------
R2 |
McFadden | 0.080
McFadden(adjusted) | 0.070
McKelvey & Zavoina | 0.170
Cox-Snell/ML | 0.090
Cragg-Uhler/Nagelkerke | 0.130
Efron | 0.095
Tjur's D | 0.093
Count | 0.735
Count(adjusted) | 0.033
-------------------------+-------------
IC |
AIC | 721.773
AIC divided by N | 1.092
BIC(df=4) | 739.748
-------------------------+-------------
Variance of |
e | 3.290
y-star | 3.964
. tab payment_pref
As before, 73% of the students prefer the fixed payment scheme. The new
information is how those not preferring fixed pay (i.e. 27%) are divided among
paying per time exclusively or preferring a mixed payment scheme. The former
is most popular with almost 18% as opposed to just under 10%. The question
now is how hours of weekly exercise, present fitness center membership and
gender affect the choice among the three payment options. When a choice
variable has three or more alternatives or options, we prefer the multinomial
logistic regression model and not the “ordinary” logistic.10 The command and
the results are displayed in Stata-output 7.9.
Interpretation of multinomial logistic models is not easy to begin with.
The trick is to note that one of the options serves as baseline in all coef-
ficient assessments. In the analysis above, this reference (i.e. base out-
come) is the exclusive pay per time option. This means that paying a fixed
fee per month (lower part of the output) or preferring a mixed payment
scheme (upper part of the output) should both be contrasted with the pay
per time option.11 Starting with the lower part of the output, we see that
students exercising a lot have a larger probability of preferring a monthly
fixed fee rather than paying per time than those exercising little (i.e. a posi-
tive and significant coefficient). Similarly, present fitness center members
have a larger probability than non-members of preferring a monthly fixed
fee rather than paying per time (i.e. a positive and significant coefficient). In
contrast, there is no gender difference in this regard since the gender coeffi-
cient is insignificant. Turning to the upper part of the output, we have that
students exercising a lot are close to having a larger probability of preferring
a mixed pay scheme rather than paying per time than those who exercise
little (i.e. a positive and almost significant coefficient; p = 0.056). We also
note that present fitness center members have a larger probability of pre-
ferring a mixed pay option rather than paying per time than non-members
(i.e. a positive and significant coefficient). Finally, women are less inclined
than men to prefer a mixed pay scheme as opposed to paying per time (i.e. a
negative and significant coefficient).
---------------------------------------------------------------------------------
payment_pref | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
pay_per_time | (base outcome)
----------------+----------------------------------------------------------------
mixed |
phys_act_h | .1240418 .0649641 1.91 0.056 -.0032854 .251369
|
fitness_c |
yes | .8417288 .367798 2.29 0.022 .120858 1.5626
|
gender |
female | -.8424532 .3355947 -2.51 0.012 -1.500207 -.1846996
_cons | -.7616118 .3593536 -2.12 0.034 -1.465932 -.0572918
----------------+----------------------------------------------------------------
fixed_per_month |
phys_act_h | .2446246 .0485079 5.04 0.000 .149551 .3396983
|
fitness_c |
yes | 1.314994 .2644774 4.97 0.000 .7966278 1.83336
|
gender |
female | -.1018814 .2497513 -0.41 0.683 -.591385 .3876222
_cons | .0339501 .2729352 0.12 0.901 -.5009931 .5688933
---------------------------------------------------------------------------------
Further reading
Long and Freese (2014) and Train (2009) are already mentioned in this
regard. Other works on how to estimate regression models for categorical/
non-numerical y-variables include Long (1997), Hosmer et al. (2013) and Best
and Wolf (2015).
Notes
1 Binary logistic (logit) regression models are often couched in terms of volun-
tary choices between A and B. However, since the voluntariness in many of these
choices could be questioned, quotation marks are sometimes in order. Long (1997)
offers more on the various theoretical motivations for regression models on cat-
egorical y-variables.
2 Some questions are direct yes or no questions, i.e. 1 versus 0. Others may be recoded
into one, such as voting democrat (1) or republican (0) and travelling abroad during
8
An ordered (ordinal) dependent variable
Logistic (logit) regression
8.1 Introduction
Regression models for numerical, binary and unordered categorical y-variables
are now among the tools in your regression kit. This leaves the ordered or
ordinal y-variables, mentioned in the appendix to Chapter 1. An ordered vari-
able, as the name implies, means that there is an ordering among its possible
values/answer categories. A case in point is the Likert scale, where the answer
categories are (coding in parentheses) strongly disagree (1), disagree (2), nei-
ther disagree nor agree (3), agree (4) or strongly agree (5). A person agreeing
(4) is more in agreement than a person disagreeing (2). But it makes little sense
claiming that s/he is twice as much in agreement, although 4 is twice as much
as 2. The reason is that the assignment of values from 1 to 5 is arbitrary. To see
this, what if 0 to 4 is used as coding instead? Then agreeing (3) is three times as
much in agreement as disagreeing (1). In the words of Long and Freese (2014,
p. 309), “…the distances between the categories are unknown” for ordered
variables. In contrast, for numerical variables, these distances are known.
Other typical ordered y-variables in the social and economic sciences are
answers to questions like:
Stata-output 8.1 Descriptive statistics for the ordinal variable exercise motive (exe_
motiv).
. use "C:\Users\700245\Desktop\temp_myd\exe_motiv_stud.dta"
. tab exe_motiv
. sum exe_motiv
------------------------------------------------------------------------------
exe_motiv | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
female | -.7043771 .1678889 -4.20 0.000 -1.033433 -.375321
health | 1.20609 .1143761 10.54 0.000 .9819175 1.430263
-------------+----------------------------------------------------------------
/cut1 | .5650401 .3256599 -.0732415 1.203322
/cut2 | 1.839089 .330419 1.19148 2.486698
/cut3 | 3.197755 .3505133 2.510762 3.884749
/cut4 | 4.827386 .37901 4.08454 5.570232
------------------------------------------------------------------------------
The only thing up for interpretation in the output are the signs and the sig-
nificance level of the regression coefficients, as with the binary logit model in
section 7.3. Both are significant. The negative sign for gender (female = 1,
male = 0) implies that female students in general report lower probabilities
of agreeing (4) or strongly agreeing (5) with the exercise statement than male
students. That is, male students find it easier to get the motivation to exer-
cise than female ones –in keeping with the expectation. The positive sign for
the health variable suggests that students with good or very good health are
more likely to agree or strongly agree with the exercise motive than students
reporting poorer health. Finally, the model has four constants, i.e. /cut1 and
so on, which in practice are overlooked.3
The reasons why logistic coefficients have no transparent metric were
explicated in section 7.3. We therefore turn directly to the interpretation and
communication aspect of the ordinal logit model.
| health
----------+---------
Current | 2.69
The output suggests that 9% of the female students strongly agree with the
statement, whereas the analogous figure for male students is 17% –an eight
percentage points difference. For the agree option the difference is almost ten
percentage points. This confirms what we already know, i.e. that male students
more easily find the motivation to exercise than female ones ceteris paribus.
For the neither/nor option there is gender parity, whereas female students
more often than male ones disagree with the motivation statement (21 versus
13%). Finally, female students are twice as likely to strongly disagree with the
motivation statement than their male counterparts (12 versus 6%).
The gender-specific probabilities above could also be expanded to, say,
students with ok health (coded 2) and students with very good health (coded
4). Stata-output 8.4 displays the command and results.
| No at()
----------+---------
Current | .n
The predicted probability of a male student (0) with ok health (2) strongly
agreeing with the exercise motivation statement is 8%; the similar probability
for a male student with very good health (4) is 50%. For female students the
corresponding probabilities are 4 and 33%. Again, we note that male students
and students with (very) good health are most likely to (strongly) agree with
the exercise motive. Female students and students with poor health dominate
at the other end of the exercise motivation scale.
The extension to more complex models involving several controls is
straightforward. What requires deliberate thinking is what to do with the
controls, i.e. what values they should be given. A typical choice for numerical
(and some ordered) variables is the mean or median, but the mean generally
makes little sense for dummies (see note 8 in Chapter 7).
Marginal effects. The benefit of using the marginal effect approach is its
ease of interpretation; it tells what happens to y in percentage points when
x increases by one unit, holding controls constant at some predefined, fixed
level. For an ordered y, however, “what happens to y” could mean sev-
eral things. Stata-output 8.5 shows the marginal effect of health status on
the exercise motive if the (control) variable gender is set to male students.
Analogously, Stata-output 8.6 shows the “same” marginal effect if gender is
set to female students.
Average predictions
Average predictions
. brant
| chi2 p>chi2 df
-------------+------------------------------
All | 7.93 0.243 6
-------------+------------------------------
1.gender | 5.56 0.135 3
health | 2.22 0.528 3
Stata-output 8.8 Descriptive statistics for the ordinal variable quality of life (qual_
life).
. tab qual_life
------------------------------------------------------------------------------
qual_life | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
female | -.0648673 .185848 -0.35 0.727 -.4291227 .2993881
health | 1.139609 .1253375 9.09 0.000 .8939522 1.385266
econ | .5863289 .1085067 5.40 0.000 .3736597 .7989982
-------------+----------------------------------------------------------------
/cut1 | .4619588 .4496377 -.4193148 1.343232
/cut2 | 3.240496 .4291612 2.399356 4.081637
/cut3 | 6.690987 .5135166 5.684513 7.697461
------------------------------------------------------------------------------
------------------------------------------------------------------------------
qual_life | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
female | -.0337418 .1885801 -0.18 0.858 -.4033521 .3358684
|
health |
poor | 1.803271 1.055825 1.71 0.088 -.2661076 3.87265
ok | 2.787032 .9830605 2.84 0.005 .8602683 4.713795
good | 3.78791 .9862848 3.84 0.000 1.854827 5.720992
very good | 5.165589 1.010412 5.11 0.000 3.185217 7.145961
|
econ |
poor | .9653349 .6141632 1.57 0.116 -.2384029 2.169073
ok | 1.344201 .5608862 2.40 0.017 .2448839 2.443517
good | 2.068292 .5727161 3.61 0.000 .9457894 3.190795
very good | 2.448862 .6456987 3.79 0.000 1.183316 3.714408
-------------+----------------------------------------------------------------
/cut1 | 1.161546 1.068282 -.9322472 3.25534
/cut2 | 3.945096 1.091726 1.805353 6.084839
/cut3 | 7.410376 1.116338 5.222395 9.598358
------------------------------------------------------------------------------
--------------------------------------------------------------------------------
qual_life | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
very_poor_poor |
gender |
female | -.7787443 .5213847 -1.49 0.135 -1.80064 .243151
econ | -.9997106 .2868687 -3.48 0.000 -1.561963 -.4374583
health | -1.246154 .3539768 -3.52 0.000 -1.939936 -.5523724
_cons | 2.654674 1.028803 2.58 0.010 .6382568 4.671091
---------------+----------------------------------------------------------------
ok |
gender |
female | -.2395571 .2325986 -1.03 0.303 -.695442 .2163277
econ | -.4196126 .1346619 -3.12 0.002 -.6835452 -.15568
health | -.9717151 .1579639 -6.15 0.000 -1.281319 -.6621115
_cons | 2.686474 .5378792 4.99 0.000 1.63225 3.740698
---------------+----------------------------------------------------------------
good | (base outcome)
---------------+----------------------------------------------------------------
very_good |
gender |
female | -.6311667 .2849884 -2.21 0.027 -1.189734 -.0725998
econ | .5419019 .1826358 2.97 0.003 .1839422 .8998616
health | .9590341 .2174268 4.41 0.000 .5328854 1.385183
_cons | -5.46392 .8824489 -6.19 0.000 -7.193488 -3.734352
--------------------------------------------------------------------------------
For the ordinal model in Stata-output 8.9 there was no gender difference in
terms of quality of life assessments among the students. This is also the case
for reporting good quality of life (i.e. base outcome) as opposed to ok or
very poor/poor quality of life. That is, the two top gender coefficients are both
insignificant. In contrast, female students have a significant lower probability
(p = 0.027) of reporting very good quality of life as opposed to good quality
of life than male students. In this sense, using the default ordinal regression
model just because the dependent variable “happens to be ordinal” would
mask a (possibly) important gender difference (see note 7).
. use "C:\Users\700245\Desktop\temp_myd\stud_tourism.dta"
. tab prob_trav
. sum prob_trav
Further reading
Long (1997, 2015), Long and Freese (2014), Fullerton (2009) and Greene and
Hensher (2010) are all noteworthy introductions to, and explications of, the
ordinal regression model.
Notes
1 The ordinal (logit) regression model is often traced back to McKelvey and Zavoina
(1975). In biostatistics the model is called the proportional odds model.
2 If linear regression was used to examine this research question, this would involve
scrutinizing how the gender dummy affected the mean of the five-point Likert scale.
More on this in section 8.6.
3 The technical background for the four constants or cut points is that the ordinal
logit model simultaneously estimates four binary logit models in the present
case: strongly disagree (1) versus all higher values (2 to 5), strongly disagree and
disagree (1 and 2) versus all higher values (3 to 5), strongly disagree and disagree
and neither/nor (1 to 3) versus all higher values (4 and 5) and strongly disagree and
disagree and neither/nor and agree (1 to 4) versus all higher values (5).
4 The “mtable” and upcoming “mchange” command (see Stata-outputs 8.5/8.6)
are among the suite of commands in the add-on “spost13_ado”, available on
the Internet (write “findit spost13_ ado” in the command window of Stata,
press Enter and follow the instructions). Long and Freese (2014) developed these
commands.
5 This test is among the suite of commands in the spost13_ado package (see
note 4).
6 Long and Freese (2014) caution against putting too much weight on this test and
related ones.
7 Assuming no ordering among the alternatives of y is always safer than assuming
an order. That is, although an y-variable is ordinal by definition, this does not by
necessity imply that ordinal regression is the most appropriate estimation procedure
(see Long and Freese, 2014).
8 “Lots of people” may of course mean lots of observations/units. Linear regression
will typically also work fine for a five-or six-point scale provided that the distribu-
tion on y is not too skewed. This modeling choice may be justified in the same way
as linear regression (i.e. the LPM) may be justified on a binary y (see section 7.2).
Exercise 8A
Use the data stud_tourism.dta. An ordinal variable in these data called
“safe” is the answer to the question “To what extent do you consider it safe
to travel to a European big city?” (1 = very unsafe; 7 very safe). Another
dummy is gender, coded 1 for females and 0 for males. The third variable of
9
The quest for a causal effect
Instrumental variable (IV) regression
9.1 Introduction
So far, we have seen several times that the estimate of b1 for x1 on y changes
in size when a new independent variable, x2, is added to the regression. Given
that much quantitative research is about trying to pinpoint exactly how x1 – as
measured by the size of b1 – affects y holding other relevant control variables
constant, we’re up for a big challenge. How come? As mentioned in Chapter 3,
data limitations invariably put us in a position lacking access to some of these
relevant controls, making our estimate of b1 uncertain. In real life, b1 might
be smaller or it might be larger; we simply do not know. This chapter is about
trying to get the correct –as in unbiased –estimate of b1 for x1 on y when we
have reason to believe that our regression model lacks one or more relevant
control variables.
Let’s assume we have a regression model in which b1 for x1 on y is 5, and
that this regression includes all relevant control variables and is otherwise cor-
rectly specified. This implies that no matter how many other controls we add
to the regression, the size of b1 (i.e. 5) will not change: The estimate of b1 is
unbiased. If so, we may tentatively refer to this regression coefficient (or effect)
as the causal effect.1 Finding the causal effect of x1 on y is mostly discussed
in the realm of doing experiments that involve manipulating the x-variable of
interest; more on this in Chapter 10. In this chapter we aim to pinpoint causal
effects for observational data where we as analysts are passive collectors of
data. The procedure is known as instrumental variable (IV) regression.
IV regression is both simple and hard at the same time. Basically, and
compared with regression in the usual linear manner, we just need one add-
itional independent variable, w. This w is called an instrument or instrument
variable and should meet two requirements: (1) It should be correlated with
x1. (2) It should not affect y, other than indirectly through x1.2 (More on this
in section 9.3.) The hard part is finding instrument variables that actually
meet both requirements.
The rest of Chapter 9 is organized as follows: Section 9.2 introduces
an example of IV regression using data on students. The variable relation-
ship of interest is between exercise behavior and health. Section 9.3 extends
. use "C:\Users\700245\Desktop\temp_myd\iv_phys_health.dta"
------------------------------------------------------------------------------
health | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .0869533 .0089926 9.67 0.000 .0692963 .1046103
gender | .1202736 .0602493 2.00 0.046 .001974 .2385731
youth_sport | .0300321 .0098181 3.06 0.002 .0107543 .0493099
_cons | 2.110347 .0933509 22.61 0.000 1.927052 2.293641
------------------------------------------------------------------------------
The output says that exercise has a positive and significant effect on health,
as to be expected. The b1 of ≈ 0.087 dynamically implies that, say, six more
hours of exercise per week entails 0.52-point better health on the 0 to 4 scale
(0.087 × 6 = 0.522). The output also suggests females having slightly better
health than males (gender is coded 1 for female and 0 for male). The variable
------------------------------------------------------------------------------
health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .178763 .057619 3.10 0.002 .0658319 .2916941
gender | .2332998 .0951796 2.45 0.014 .0467513 .4198483
youth_sport | .0115949 .0155187 0.75 0.455 -.0188213 .0420111
_cons | 1.728247 .2567162 6.73 0.000 1.225093 2.231402
------------------------------------------------------------------------------
Instrumented: phys_act_h
Instruments: gender youth_sport sport_club
The IV regression finds the b1 for exercise to be ≈ 0.179, i.e. about twice as
large as the linear regression b1. (We pay no attention to the effects of other
variables in IV regression; this method is all about obtaining the unbiased
estimate of b1 for the independent variable of key interest.) Note also that the
standard error (SE) of the IV coefficient is much larger than the analogous
SE in the linear regression case; this is a general feature of IV regression (see
Wooldridge, 2000).
Taking stock, the IV regression finds the effect of exercise on health to be
much larger than what the “naïve” linear regression suggests. Is this a tenable
result? We’ll return to this question later; now we take a closer look at the
assumptions that IV regression is built on.
Stata-output 9.3 Test of instrument relevance for the instrument variable sport club
membership.
. estat firststage
------------------------------------------------------------------------------
health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .122888 .0209191 5.87 0.000 .0818874 .1638886
gender | .1645125 .0650596 2.53 0.011 .0369981 .2920269
youth_sport | .0228157 .0106033 2.15 0.031 .0020336 .0435978
_cons | 1.960791 .1225684 16.00 0.000 1.720562 2.201021
------------------------------------------------------------------------------
Instrumented: phys_act_h
Instruments: gender youth_sport sport_club fitness_c
. estat firststage
. estat overid
------------------------------------------------------------------------------
health_good | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .1444594 .0199195 7.25 0.000 .1054179 .1835009
gender | .2226788 .113964 1.95 0.051 -.0006865 .4460441
youth_sport | .0443555 .0181226 2.45 0.014 .0088359 .0798751
_cons | -.6309021 .1766503 -3.57 0.000 -.9771304 -.2846739
------------------------------------------------------------------------------
In these settings –i.e. when all relevant control variables are most probably
not available in the datasets up for analysis –IV regression may be a key tool
in the quantitative social scientist’s toolbox. That said, IV regression is by no
means a quick-fix, as the next paragraph emphasizes.
A cautionary tale. IV regression is neither a mechanistic solution nor a magic
bullet in terms of getting an unbiased b1 for x1 on y using observational data.
On the contrary, the procedure rests on strong assumptions that in most
settings are not empirically testable but rely on sound, theoretical judgement.
A related issue concerns whether or not the IV regression confirms the basic
story told by for example a linear regression. If it does –or, as in our case
above, amplifies it –we generally have stronger support for our initial hypoth-
esis. In contrast, when the IV story radically diverges from the linear regres-
sion story, we tend to have a problem…
Further reading
Despite the fact that there are specific IV regression introductions to a number
of the sub-disciplines in the social science (see below), I recommend Stock
and Watson (2007) for every prospect social scientist as the introduction to
IV regression. A more recent introduction is Muller et al. (2015). Other than
this, we have:
Notes
1 Causal effects are typically discussed with regard to medical experiments (trials).
Here, one group of subjects gets treatment (e.g. a pill containing real medicine),
whereas the other group gets no treatment (a placebo pill containing no real medi-
cine). Furthermore, the assignment to treatment or placebo is based on a random
procedure; e.g. flip of a coin. This treatment/placebo variable might be thought of as
a dummy, x1. Two months later, both groups undergo a diagnostic medical test, i.e.
the dependent variable, y. The b1 of this x1-y regression is the average causal effect
or treatment effect. More complicated designs involve different dosages (placebo,
light, moderate, heavy, very strong), but the same logic holds. I have more to say on
regression analysis for experimental data in the social sciences in section 10.2.
2 Rosenbaum (2017) argues, in my view correctly, that an instrument is the proper
term, but since an instrument variable seems to be the term typically used in the
literature, I’ll stick with the latter.
3 Another reason why the exercise effect on health probably is not causal is that we
cannot in a cross-sectional context rule out the possibility that health affects exer-
cising; i.e. that people who are healthy for, say, primarily genetic reasons, are more
able to exercise more often. As it turns out, without going into details, IV regression
also tackles this simultaneity problem. See also section 11.4 on this problem.
4 The exercise variable, x1, is called endogenous in econometrics lingo, which means
correlated with y’s (here: health’s) error term, e. In layman speak we would call it
“problematic” in the sense that the estimate of b1 probably is biased. Or, at least,
that this cannot be ruled out.
5 The “2sls” in the command is short for two-stage least squares estimator; the instru-
ment variable sport_club is written after the “=” in the parenthesis at the end of
the command.
6 If the F-statistic is < 10, the instrument is said to be “weak”. This, in general, makes
the IV estimate of b1 less credible. See Murray (2006) for more on weak instruments.
7 The logic behind this test is spelled out in Stock and Watson (2007, p. 443–445).
It must be emphasized, however, that a non-significant test result by no means
confirms that the instrument variables are exogenous.
8 This, at first glance, seems to be the rule rather than the exception in the literature,
emphasizing how the quality of the instrument variables are fundamental requisites
to the trustworthiness of IV results.
9 To create this new dummy variable –i.e. health_good –I first used the initial
commands:
gen health_
good = health
recode health_good 0/2=0 3/4=1
Part 4
10
Regression purposes in various
academic settings and how
to perform them
10.1 Introduction
Up until now the focus of this book has been on how regression works, how
to do it on a computer and how to interpret the results that statistics programs
like Stata, R or SPSS produce. This chapter brings together this material and
puts it into the realm of doing and reporting regression analysis in various
academic settings. Section 10.2 presents four different ways of doing regres-
sion analysis according to its main purpose or motivation. In practice, though,
the boundaries between these ways of doing regression are (unfortunately)
often blurred in current research. Section 10.3 presents the typical structure
or setup of a regression-based study, whereas section 10.4 closes with some
handy advice on how to proceed with a regression-based academic project.
Note. Standard errors are in parentheses. Model B also controls for department (six dummies).
a The question reads “To what extent do you think Norway has had a successful gender equality
policy for academia so far?”, with answer categories ranging from “to a very little extent” (1) to
“to a very large extent” (10).
b Left-wing political standpoint is the reference.
sample. For this reason, and since it makes little sense to think of the univer-
sity college in question as representative of all such colleges in Norway, the
inferential statistics covered in Chapter 4 and the C assumptions mentioned
in section 5.4 are not necessarilly very relevant. In any event, the focus here is
strictly on what descriptively happens in the data at hand.
Model A shows that the mean of the more-gender-equality-effort scale is
6.31 for men (the constant) and 8.55 for women (6.31 + 2.24 = 8.55), or that
women on average score 2.24 points higher than men on this scale from 1
to 10. In other words, female faculty members are more in favor of gender-
equality enhancing policies than male faculty members.3 The gender variable
accounts for 22% of the variation (variance) in this equality-effort-scale; cf.
the R2.
Model A reports the unconditional gender difference in the effort scale.
Model B, in contrast, reports the conditional gender difference, which is 1.80.
This suggests that the controls “explain away” about 20% of the uncondi-
tional gender difference (2.24). However, the (unexplained) gender difference
(1.80) is still substantial. As for the controls, we see that those who view the
former policy on the subject as successful tend to see less need for gender-
equality enhancing policies, i.e. they have lower scores on the effort scale.
Furthermore, those with doctorates tend to see less need for gender-equality
enhancing policies than those without. By the same logic, left-wingers are more
Note. Standard errors are in parentheses. Model B also controls for department (six dummies).
a The question reads “To what extent do you think Norway has had a successful gender equality
policy for academia so far?”, with answer categories ranging from “to a very little extent” (1) to
“to a very large extent” (10).
b Left-wing political standpoint is the reference.
* p < 0.05; ** p < 0.01; *** p < 0.001 (two-sided tests).
Panel A
WOM:a
Movie very 1.13** 1.04** 1.07*** 0.84* 1.15*** 0.56 1.26***
worthy of (0.353) (0.343) (.327) (0.359) (0.341) (0.354) (0.370)
seeing
Constant 5.49 3.36 3.81 4.31 5.71 3.92 2.64
Controls: No No No No No No No
R2 0.18 0.12 0.10 0.09 0.13 0.05 0.12
Panel B
WOM:a
Movie very 1.22*** 0.99** 1.04*** 0.893* 1.16*** 0.51 1.21***
worthy of (0.348) (0.342) (0.317) (0.358) (0.339) (0.356) (0.371)
seeing
Constant 5.63 3.26 5.77 4.36 7.37 4.29 2.64
Controls: Yes Yes Yes Yes Yes Yes Yes
R2 0.23 0.15 0.18 0.13 0.17 0.06 0.14
Note. Standard errors are in parentheses. Control variables are gender, age, number of cinema
visits during the previous two months and number of movie-streaming subscriptions.
a = Reference: Movie somewhat worthy of seeing.
1 = Genre: Action and adventure.
2 = Genre: Serious drama.
3 = Genre: Light drama/comedy drama.
4 = Genre: Science-fiction.
5 = Genre: Comedy.
6 = Genre: Norwegian movie.
7 = Genre: Documentary.
* p < 0.05; ** p < 0.01; *** p < 0.001 (two-sided tests).
The regression coefficients for WOM are in the main unaffected by adding of
controls, exemplifying how the experimental design makes control variables
more or less redundant.16
The analyses above tacitly assume that the causal effects are similar for all
subgroups in the data, such as males and females. To test if this is plausible or
not, one might add interaction variables to the regression models in the usual
manner (see section 6.2). This is often referred to as examining effect hetero-
geneity in the lingo of regression analysis for experimental data.
Regression as causal analysis: Some final thoughts. Finding a causal effect is
the most ambitious social science research endeavor. Whether or not plain
vanilla regression analysis might be used for this purpose depends primarily
on how the data up for analysis came about, i.e. the data generating process.
From a purist stance one must have some sort of manipulation of x1 if we are
even to start talking about causal effects. That is, for some, causal analysis
is for experimental data only. But under such circumstances, as illustrated
above, regression fits the bill perfectly as a statistical tool of choice. This
brings the old adage “correlation is not causation” to the fore. When people
Abstract
1. Introduction
1.1 The study purpose
1.2 Theory and prior research
1.3 Precise research goal/hypothesis
2. Data and methods
2.1 Presentation of data
2.2 Presentation of main variables
2.3 Presentation of regression results
3. Conclusions and implications
3.1 Study purpose revisited
3.2 Main findings and conclusions
3.3 Implications for future research and policy
make this claim, they tend to mean that regression cannot be used to prove
causation. The analyses above –as well as regression on experimental data in
general –attest to the contrary. The key point in this respect is that it is first
and foremost the data generating process, and not the statistical tool per se,
that allows or disallows causal inference.
Despite experimental designs having become more in vogue in the social
sciences of late, they are nevertheless not viable in many situations –for eth-
ical, practical, political or economic reasons.17 This leaves the quantitative
social scientist with “only” the observational data of this book. Should causal
aspirations be abandoned here? Taking the purist stance on causal analysis,
the answer is certainly yes. A more pragmatic stance, perhaps a tentatively
more positive approach, lies in the use of IV regression or making sure the
linear regression model includes all relevant control variables. Other empir-
ical strategies for obtaining causal effects for observational data are discussed
briefly in sections 11.4 and 11.6.
Notes
1 When y is binary or ordered, which also makes for prediction, some equivalent
pseudo R2 measure or proportion of correct classifications is typically used.
2 The remaining part of section 10.2 draws heavily on Berk (2004).
3 Without some benchmark it is difficult to say what constitutes a large difference
between two means. Cohen’s d (Cohen, 1988) is typically used as an effect size
measure in this regard. The Cohen’s d for this gender difference is 1.08, which
typically is thought of as a “large” effect in the social sciences (d > 0.80 = large
difference).
4 Provided there is no need for squared terms/sets of dummies to capture non-
linearties or product terms to capture interaction effects. The use of logarithms is
not relevant here since the y-variable has rather few values.
5 If not stated explicitly otherwise, Stata and other statistics programs always
assume by default that the data in question are a random sample from some larger
population. For this reason, they always calculate inferential statistics.
6 The A1 assumption plays pretty much the same role in descriptive studies as in
inference beyond the data studies.
7 The alternative justification for significance testing is so-called model-based signifi-
cance testing; see Hoem (2008) or Lumley (2010) for more on this.
8 There is a flavor of Sherlock Holmes- logic in play here: If all other causes
(suspects) have been ruled out by logic/evidence, the remaining cause (suspect)
must be “guilty”, however improbable that sounds. See also Pearl and Mackenzie
(2018) or Rosenbaum (2017).
9 Of course, since settings are never ideal, there might be tiny correlations (positive
or negative) between x1 and the controls. This is typically why b1 might change
somewhat in size when controls are added to the model. Remember: A relevant
control variable affects y and correlates with x1. Yet since the RCT by its design
ensures that the latter correlation is close to zero if all goes well, the A1 assumption
becomes superfluous.
10 More complicated designs involve three treatments or one placebo and two
treatments.
11 References to this literature are found in Thrane (2018).
12 Note also that the units analyzed in this research are movies, and not people
making actual movie-seeing decisions.
11
The way ahead
Related techniques
11.1 Introduction
The time has come to look forward towards other, related statistical techniques
that might be handy in a regression analyst’s toolbox. These techniques, it
should be emphasized, are often tailor-made and more statistically complex
solutions to the idiosyncrasies in the data up for analysis. Yet they are gen-
erally straightforward to implement given a firm grip on the plainer vanilla
regressions offered in this book so far.
------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
square_m | 2223.958 315.2902 7.05 0.000 1598.275 2849.642
_cons | 134130.2 45949.42 2.92 0.004 42945.08 225315.3
------------------------------------------------------------------------------
Stata-output 11.2 Quantile (q25, q50, q75) regression of house sale prices (price_
Euro) on house sizes (square_m).
------------------------------------------------------------------------------
| Bootstrap
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
q25 |
square_m | 2661.458 156.0873 17.05 0.000 2351.708 2971.209
_cons | 4374.998 23900.16 0.18 0.855 -43054.1 51804.09
-------------+----------------------------------------------------------------
q50 |
square_m | 2223.958 351.2313 6.33 0.000 1526.951 2920.966
_cons | 134130.2 60473.29 2.22 0.029 14122.91 254137.5
-------------+----------------------------------------------------------------
q75 |
square_m | 2619.048 346.9013 7.55 0.000 1930.633 3307.462
_cons | 132619 49833.23 2.66 0.009 33726.61 231511.4
------------------------------------------------------------------------------
. use "C:\Users\700245\Desktop\temp_myd\football_earnings.dta"
Output omitted …
------------------------------------------------------------------------------
goals_season | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
games_season | .0653718 .0109971 5.94 0.000 .0438178 .0869257
midfield | .9295584 .1661253 5.60 0.000 .6039587 1.255158
striker | 1.827004 .1734234 10.53 0.000 1.487101 2.166908
_cons | -1.632941 .2789111 -5.85 0.000 -2.179596 -1.086285
-------------+----------------------------------------------------------------
/lnalpha | -.7654989 .2286809 -1.213705 -.3172925
-------------+----------------------------------------------------------------
alpha | .4651018 .1063599 .2970944 .7281177
------------------------------------------------------------------------------
LR test of alpha=0: chibar2(01) = 63.18 Prob >= chibar2 = 0.000
Since the NB coefficients have little intuitive appeal on their own, the
thing to notice in the output is their sign and significance level. Note that
players with more appearances on the pitch during the season score more
goals, and that midfielders and offensive players score more goals than
defenders. All three results are much as expected, and all three coefficients
are significant.1
. mchange
| Change p-value
-------------+----------------------
games season |
+1 | 0.154 0.000
+SD | 1.526 0.000
Marginal | 0.149 0.000
midfield |
+1 | 3.490 0.001
+SD | 1.261 0.000
Marginal | 2.115 0.000
striker |
+1 | 11.869 0.000
+SD | 2.420 0.000
Marginal | 4.158 0.000
Interpretation: One more match played during the season increases the
number of goals scored by 0.15 on average, all-else-equal and interpreted
dynamically. On average, midfielders score 2.12 more goals than defenders
all-else-equal, whereas the similar difference between defenders and offensive
players is 4.16 goals.
Excess zeros. One sometimes encounters that the y of interest contains an
“excess” number of zeros, meaning that the zero count is much more fre-
quent (relatively) than positive counts. A case in point is number of cigarettes
smoked in a day. Here, all non-smokers, typically 80% or more in a “normal”
sample of adults, report zero. (This was partly also the case for the goal-
scoring model estimated above; 40% of the players scored zero goals during
the season.)
To accommodate excess zeros, we use the zero-inflated count data regression
models. In these models the independent variables affect two processes: (1)
y = 0 versus y > 0 and (2) amount of y, given y > 0. I use the data football_
earnings.dta to illustrate again. The y is total number of international
matches played in the career (games_int; 60% have zero international
matches), and the independent variables are age and performance. Stata-
output 11.5 displays the results.
Output omitted …
------------------------------------------------------------------------------
games_int | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
games_int |
age | .0681041 .0311644 2.19 0.029 .007023 .1291852
performance | .410101 .2730644 1.50 0.133 -.1250954 .9452975
_cons | -1.549452 1.661591 -0.93 0.351 -4.806111 1.707207
-------------+----------------------------------------------------------------
inflate |
age | -.2276064 .0454371 -5.01 0.000 -.3166615 -.1385512
performance | -2.223691 .4688857 -4.74 0.000 -3.14269 -1.304692
_cons | 16.52414 2.744172 6.02 0.000 11.14567 21.90262
-------------+----------------------------------------------------------------
/lnalpha | .3715287 .2249229 1.65 0.099 -.0693122 .8123696
-------------+----------------------------------------------------------------
alpha | 1.449949 .3261269 .9330353 2.253241
------------------------------------------------------------------------------
The top coefficients tell us that age has a significant and positive effect
on number of international matches played, given that a player has played
at least one such match. In contrast, the performance variable’s similar
effect is insignificant. The bottom coefficients say that both independent
variables have negative and significant effects on not having played an
international match –or positive effects on having played one or more such
matches. That is, both independent variables affect whether a player is “an
international” or not. More refined interpretations of effects may also
be done for the zero-inflated count data regression model; see Long and
Freese (2014). The standard reference for count data regression models is
Cameron and Trivedi (2013).2
Notes
1 The significant Alpha-test at the bottom of the output suggests that the NB model
is the better choice than the Poisson model. More technically, there is evidence of
so-called overdispersion (cf. Long and Freese, 2014).
References
Agresti, A. and Finlay, B. (2014). Statistical Methods for the Social Sciences. Fourth
Edition. Harlow, England: Pearson Education Limited.
Ai, C. and Norton, E. C. (2003). Interaction terms in logit and probit models.
Economics Letters, 80, 123–129.
Allison, P. D. (1999). Multiple Regression. A Primer. Thousand Oaks, CA: Pine
Forge Press.
Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics. An Empiricist’s
Guide. Princeton, NJ: Princeton University Press.
Baron, R. M. and Kenny, D. A. (1986). The moderator-mediator variable distinction
in social psychological research: Conceptual, strategic, and statistical consider-
ations. Journal of Personality and Social Psychology, 51, 1173–1182.
Berk, R. A. (2004). Regression Analysis. A Constructive Critique. Thousand Oaks,
CA: Sage Publications, Inc.
Berry, W. D. (1993). Understanding Regression Assumptions. Newbury Park, CA: Sage
Publications, Inc.
Best, H. and Wolf, C. (2015). Logistic regression. In Best, H. and C. Wolf: The Sage
Handbook of Regression Analysis and Causal Inference, pp. 153–171. London: Sage
Publications, Inc.
Blossfeld, H.-P. and Blossfeld, G. J. (2015). Even history analysis. In Best, H. and C.
Wolf: The Sage Handbook of Regression Analysis and Causal Inference, pp. 359–
386. London: Sage Publications, Inc.
Bollen, K. A. (2012). Instrumental variables in sociology and the social sciences.
Annual Review of Sociology, 38, 37–72.
Brechot, M., Nüesch, S. and Franck, E. (2017). Does sports activity improve health?
Representative evidence using local density of sports facilities as an instrument.
Applied Economics, 49, 4871–4884.
Breen, R., Karlson, K. B. and Holm, A. (2013). Total, direct, and indirect effects in
logit and probit models. Sociological Methods & Research, 42, 164–191.
Cameron, A. C. and Trivedi, P. K. (2013). Regression Analysis of Count Data. Second
Edition. New York: Cambridge University Press.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences.
London: Routledge.
Deaton, A. and Cartwright, N. (2018). Understanding and misunderstanding
randomized controlled trials. Social Science & Medicine, 210, 2–21.
DeMaris, A. (2004). Regression with Social Data. Modeling Continuous and Limited
Response Variables. Hoboken, NJ: John Wiley & Sons.
References 185
Dougherty, C. (2011). Introduction to Econometrics. Fourth Edition. Oxford: Oxford
University Press.
Fox, J. (1997). Applied Regression Analysis, Linear Models, and Related Methods.
Thousand Oaks, CA: Sage Publications, Inc.
Freedman, D., Pisani, R. and Purves, R. (1998). Statistics. Third Edition.
New York: W.W. Norton & Company.
Fullerton, A. S. (2009). A conceptual framework for ordered logistic regression models.
Sociological Methods & Research, 38, 306–347.
Gangl, M. (2015). Matching estimators for treatment effects. In Best, H. and C.
Wolf: The Sage Handbook of Regression Analysis and Causal Inference, pp. 251–
276. London: Sage Publications, Inc.
Greene, W. H. and Hensher, D. A. (2010). Modeling Ordered Choices. A Primer.
New York: Cambridge University Press.
Gujarati, D. (2015). Econometrics by Example. Second Edition. London: Palgrave
Macmillan.
Halvorsen, R. and Palmquist, R. (1980). The interpretation of dummy variables in
semilogarithmic equations. American Economic Review, 70, 474–475.
Hernán, M. A. and Robins, J. M. (2006). Instruments for causal inference. An
epidemiologist’s dream? Epidemiology, 17, 360–372.
Hoem, J. M. (2008). The reporting of statistical significance in scientific journals.
Demographic Research, 18, 437–440.
Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression.
Third Edition. Hoboken, NJ: John Wiley & Sons.
Hox, J. and Wijngaards-de Meij, L. (2015). The multilevel regression model. In Best,
H. and C. Wolf: The Sage Handbook of Regression Analysis and Causal Inference,
pp. 11–131. London: Sage Publications, Inc.
Imbens, G. W. (2014). Instrumental variables: An econometrician’s perspective.
Statistical Science, 29, 323–358.
Kennedy, P. (2002). Sinning in the basement: What are the rules? The ten commandments
of applied econometrics. Journal of Economic Surveys, 16, 569–589.
King, G. (2006). Publication, publication. Political Science and Politics, 39, 119–125.
King, G. and Roberts, M. E. (2015). How robust standard errors expose methodological
problems they do not fix, and what to do about it. Political Analysis, 13, 159–179.
Koenker, R. (2005). Quantile Regression. New York: Cambridge University Press.
Lee, D. S. and Lemieux, T. (2015). Regression discontinuity designs in social sciences.
In Best, H. and C. Wolf: The Sage Handbook of Regression Analysis and Causal
Inference, pp. 301–326. London: Sage Publications, Inc.
Lohmann, H. (2015). Non-linear and non-additive effects in linear regression. In Best,
H. and C. Wolf: The Sage Handbook of Regression Analysis and Causal Inference,
pp. 11–131. London: Sage Publications, Inc.
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables.
Thousand Oaks, CA: Sage Publications, Inc.
Long, J. S. (2015). Regression models for nominal and ordinal outcomes. In Best, H.
and C. Wolf: The Sage Handbook of Regression Analysis and Causal Inference, pp.
173–203. London: Sage Publications, Inc.
Long, J. S. and Freese, J. (2014). Regression Models for Categorical Dependent Variables
Using Stata. Third Edition. College Station, TX: Stata Press.
Lumley, T. (2010). Complex Surveys. A Guide to Analysis Using R. Hoboken, NJ: John
Wiley & Sons.
186 References
Martens, E. P., Pestman, W. R., de Boer, A., Belitser, S. V. and Klungel, O. H. (2006).
Instrumental variables. Applications and limitations. Epidemiology, 17, 260–267.
McKelvey, R. D. and Zavoina, W. (1975). A statistical model for the analysis of ordinal
level dependent variables. Journal of Mathematical Sociology, 4, 103–120.
Menard, S. (1995). Applied Logistic Regression Analysis. Thousand Oaks, CA: Sage
Publications, Inc.
Meuleman, B., Loosveldt, G. and Emonds, V. (2015). Regression analysis: Assumptions
and diagnostics. In Best, H. and C. Wolf: The Sage Handbook of Regression Analysis
and Causal Inference, pp. 83–110. London: Sage Publications, Inc.
Mitchell, M. N. (2012). Interpreting and Visualizing Regression Models Using Stata.
College Station, TX: Stata Press.
Morgan, S. L. and Winship, C. (2015). Counterfactuals and Causal Inference. Second
Edition. New York: Cambridge University Press.
Muller, C., Winship, C. and Morgan, S. L. (2015). Instrumental variables regression.
In Best, H. and C. Wolf: The Sage Handbook of Regression Analysis and Causal
Inference, pp. 277–299. London: Sage Publications, Inc.
Murray, M. P. (2006). Avoiding invalid instruments and coping with weak instruments.
Journal of Economic Perspectives, 20, 111–132.
Nagler, J. (1994). Scobit: An alternative estimator to logit and probit. American Journal
of Political Science, 38, 230–255.
Pearl, J. and Mackenzie, D. (2018). The Book of Why. The New Science of Cause and
Effect. New York: Basic Books.
Pedhazur, E. J. (1997). Multiple Regression in Behavioral Research. Explanation and
Prediction. Third Edition. Orlando, FL: Harcourt Brace College Publishers.
Pieters, R. (2017). Meaningful mediation analysis: Plausible causal inference and
informative communication. Journal of Consumer Research, 44, 692–716.
Rosenbaum, P. R. (2017). Observation & Experiment. An Introduction to Causal
Inference. Cambridge, MA: Harvard University Press.
Sovey, A. J. and Green, D. P. (2010). Instrumental variables estimation in political
science: A readers’ guide. American Journal of Political Science, 55, 188–200.
Sterba, S. K. (2009). Alternative model-based and design-based frameworks for infer-
ence from samples to populations: From polarization to integration. Multivariate
Behavioral Research, 44, 711–740.
Stigler, S. M. (2016). The Seven Pillars of Statistical Wisdom. Cambridge, MA: Harvard
University Press.
Stock, J. H. and Watson, M. W. (2007). Introduction to Econometrics. Second Edition.
Boston, MA: Pearson Education.
Studenmund, A. H. (2017). Using Econometrics. A Practical Guide. Seventh Edition.
Harlow, England: Pearson Education Limited.
Thrane, C. (2018). Do experts’ reviews affect the decision to see motion pictures in
movie theaters? An experimental approach. Applied Economics, 50, 3066–3075.
Train, K. E. (2009). Discrete Choice Models with Simulation. Second Edition.
New York: Cambridge University Press.
Wasserstein, R. L., Schirm, A. L. and Lazar, N. A. (2019). Moving to a world beyond
“p < 0.05”. American Statistician, 73(S1), 1–19.
Wheelan, C. (2013). Naked Statistics. New York: W.W. Norton & Company.
Wickham, H. and Grolemund, G. (2016). R for Data Science. Sebastopol, CA: O’Reilly
Media, Inc.
References 187
Williams, R. (2006). Generalized ordered logit/partial proportional odds models for
ordinal dependent variables. The Stata Journal, 6, 58–82.
Wolf, C. and Best, H. (2015). Linear regression. In Best, H. and C. Wolf: The Sage
Handbook of Regression Analysis and Causal Inference, pp. 55–81. London: Sage
Publications, Inc.
Wooldridge, J. M. (2000). Econometric Analysis. A Modern Approach. USA: South-
Western Publishing.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.
Second Edition. Cambridge, MA: MIT Press.
Index
Index 189
Gangl, M. 182 probabilities 126–7; probit model 133;
Green, D. P. 159 sequential choices 133; standard
Greene, W. H. 148 model 123–4; see also ordinal (logit)
Gujarati, D. 32 regression model
Long, J. S. 133, 136, 148, 180
Halvorsen, R. 111 longitudinal data 31, 180–1
Hensher, D. A. 148 lurking variables 44
Hernán, M. A. 159
heteroscedasticity 88–9 Mackenzie, D. 61, 159
homoscedastic residuals 87–9 marginal effects 124–5, 140–1
Hosmer, D. W. 133 Martens, E. P. 159
Hox, J. 182 matching 182
mean confidence interval 72–3
Imbens, G. W. 159 measurement levels 8–10
independent variable (x) 6, 65; mechanism see mediation models
assumptions 81–2; see also dummy median regression 176–7
variable mediation models 113–16, 133
index 25–6, 65 Meuleman, B. 86
inferences 166–8 model fit 129–30
influential outliers 86 Morgan, S. L. 159
instrument exogeneity 154 motivations see purposes
instrument relevance 153–4 Muller, C. 158
instrumental variable (IV) regression multi-level regression models 181–2
150–3, 158–9; assumptions 153–6; multicollinearity see perfect
dummy variable 156–7; instrument multicollinearity
exogeneity 154; instrument multinomial logistic regression
relevance 153–4 130–2, 144–5
interaction models 103–5, 127–8 multiple ordinal (logit) regression model
interaction variable 102, 144 see ordinal (logit) regression model
intercept see constant multiple regression 43–5, 54–6, 61, 65;
explanatory power (R2) 56–9;
Kenny, D. A. 116 independent dummy variables 50–4;
motivations 50; third variable 46–9
Likert scale 136
linear probability model (LPM) negative binomial model 178–80
121–2, 125–6 non-additivity 101–5
linear regression 11–18, 31–2; dummy non-linearity 105–12, 116, 128–9, 142–4
variable 20–2; error term (e) 30; non-representative samples/
female students’ exercise hours 25–7; populations 78
footballer pay/performance 22–5; normal distribution 71–2, 93, 96
house size/price 11–22; ordinal normally distributed residuals 90–1
variables 146–7; Ordinary null hypothesis (H0) 74–5, 78, 97
Least Squares (OLS) 27–30;
questionnaires 25–7; slope and observational data 31
constant 19; see also assumptions observations see units
linearity assumption 82–3, 105–12 omitted variable bias 82
logarithms 108–12 one-sided significance test 78–9
logistic (logit) regression 121; ordinal (logit) regression model 137–8,
assumptions 133; interaction 147–8; alternative strategies 144–6;
models 127–8; interpretation interaction effects 144; interpretation
124–7; linear probability model 139–41; marginal effects 140–1;
(LPM) 125–6; marginal effects non-linearity 142–4; predicted
124–5; mediation models 133; probabilities 139–40; proportional
model fit 129–30; multinomial odds-assumption 141–2, 146;
130–2; non-linearity 128–9; predicted sequential choices 146
190
190 Index
ordinal variables 136–7, 146–7 data regression 180–1; quantile
Ordinary Least Squares (OLS) 27–30 regression 176–7; time-series
outcome variable see dependent regression 181
variable relevant control variables 47–8, 81–2
outliers see influential outliers repeated sampling 71–2
research questions 7, 46–9
Palmquist, R. 111 residuals 65, 87–92
panel data regression 180–1 residuals versus predicted values 88
path model 113–16 respondents 6, 66
Pearl, J. 61, 159 response variable see dependent variable
Pedhazur, E. J. 164 Robins, J. M. 159
perfect multicollinearity 83–6 Rosenbaum, P. R. 159, 182
Pieters, R. 116
Pischke, J.-S. 156, 159 samples 69–70, 79, 97; random
point-and-click routine: SPSS 38–42; sampling 70–1; repeated
Stata 34–8 sampling 71–2
Poisson model 178 sequential choices 133, 146
population 69–70, 79, 97 significance levels 76, 78
predicted probabilities 126–7, 139–40 significance probability (p) 76–7
prediction 163–4 significance testing 74–9, 97
predictor variable see independent slope see regression coefficient
variable Sovey, A. J. 159
probability see linear probability SPSS (point-and-click routine) 38–42
model (LPM) standard deviation 12
probit model 133, 147 standard error (SE) 73, 89, 97
proportional odds-assumption Stata 6–7, 11–18, 54, 78; point-and-click
141–2, 146 routine 34–8
purposes 163; causal analysis 168–71; stepwise regression 174
description 164–6; inferences 166–8; Sterba, S. K. 78
prediction 163–4 Stock, J. H. 158, 181
studies see academic regression-based
quadratic regression models 105–7 studies
quantile regression 176–7 subjects 6, 66
questionnaires 25–7 survival analysis 182
Index 191
Wijngaards-de Meij, L. 182 x see independent variable
Winship, C. 159
Wolf, C. 32, 133 y see dependent variable
Wooldridge, J. M. 28, 31–2, 61, 79, 86,
93, 181–2 zero conditional mean 93
192