Applied Regression Analysis by Christer Thrane

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 203

i

Applied Regression Analysis

This book is an introduction to regression analysis, focusing on the practical-


ities of doing regression analysis on real-​life data.
Contrary to other textbooks on regression, this book is based on the idea
that you do not necessarily need to know much about statistics and math-
ematics to get a firm grip on regression and perform it to perfection. This
non-​technical point of departure is complemented by practical examples of
real-​life data analysis using statistics software such as Stata, R and SPSS.
Parts 1 and 2 of the book cover the basics, such as simple linear regression,
multiple linear regression, how to interpret the output from statistics programs,
significance testing and the key regression assumptions. Part 3 deals with how
to practically handle violations of the classical linear regression assumptions,
regression modeling for categorical y-​variables and instrumental variable (IV)
regression. Part 4 puts the various purposes of, or motivations for, regres-
sion into the wider context of writing a scholarly report and points to some
extensions to related statistical techniques.
This book is written primarily for those who need to do regression analysis
in practice, and not only to understand how this method works in theory.
The book’s accessible approach is recommended for students from across the
social sciences.

Christer Thrane holds a Ph.D. in Sociology and is Professor at Inland


Norway University of Applied Sciences. His research interests include quan-
titative modeling studies in leisure, recreation, sports and tourism. He has
25 years of experience in teaching quantitative research methods and regres-
sion analysis.

a.indd 1 27-Aug-19 15:03:26


ii

a.indd 2 27-Aug-19 15:03:26


iii

Applied Regression Analysis


Doing, Interpreting and Reporting

Christer Thrane

a.indd 3 27-Aug-19 15:03:26


iv

First published 2020


by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
and by Routledge
52 Vanderbilt Avenue, New York, NY 10017
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2020 Christer Thrane
The right of Christer Thrane to be identified as author of this work has been asserted by
him in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or utilised
in any form or by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying and recording, or in any information
storage or retrieval system, without permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks or registered trademarks,
and are used only for identification and explanation without intent to infringe.
British Library Cataloguing-​in-​Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-​in-​Publication Data
A catalog record has been requested for this book
ISBN: 978-​1-​138-​33547-​9 (hbk)
ISBN: 978-​1-​138-​33548-​6 (pbk)
ISBN: 978-​0-​429-​44375-​6 (ebk)
Typeset in Times New Roman
by Newgen Publishing UK
Visit the eResources: www.routledge.com/​9781138335486

a.indd 4 27-Aug-19 15:03:26


v

Contents

Acknowledgements  ix

PART 1
The basics  1

1 What is regression analysis?  3


1.1 Introduction: Sizes and prices of houses 3
1.2 What this book is about and why it has been written 5
1.3 Some terminology and a word about statistical software 5
1.4 Research questions in the social sciences and economics:
The link to regression analysis 7
1.5 The book’s intended audience, pedagogical approach and
what lies ahead in Part 1 7
Notes 8
Appendix: A short primer on variables and their measurement
levels and how this relates to regression analysis 8

2 Linear regression with a single independent variable  11


2.1 Prices of houses once more 11
2.2 Prices of houses: The regression coefficient (b1) and the
constant (b0) 19
2.3 Prices of houses: An independent variable with two values
(a dummy variable) 20
2.4 From data on houses to data on soccer players:
On performance and pay in football 22
2.5 Data from questionnaires: Female students’ exercise hours 25
2.6 (Optional) calculation of the regression coefficient (b1) and
the constant (b0) by Ordinary Least Squares (OLS) 27
2.7 A quick word about the error term (e) 30
2.8 Summary, a few words on data types and further reading 31
Notes 32

a.indd 5 27-Aug-19 15:03:26


vi

vi Contents
Appendix A: A short introduction to Stata’s point-​and-​click routine 34
Appendix B: A short introduction to the point-​and-​click routine in
SPSS 38

3 Linear regression with several independent


variables: Multiple regression  43
3.1 Introduction 1: Prices of houses again 43
3.2 Introduction 2: Social science research questions
and the need to control for a third variable 46
3.3 Two motivations for multiple regression 50
3.4 A special case: Sets of independent dummy variables 50
3.5 A short summary and the regression model’s explanatory
power (R2) 54
3.6 Summary and further reading 61
3.7 Informal glossary to Part 1 64
Notes 61
Exercises with data and solutions 62
Informal glossary to Part 1 64

PART 2
The foundations  67

4 Samples and populations, random variability and testing


of statistical significance  69
4.1 Introduction 69
4.2 Populations, samples and representative samples:
The magic of random sampling 70
4.3 The normal distribution and repeated sampling 71
4.4 Confidence intervals for means and regression coefficients 72
4.5 Testing for the significance of a regression coefficient:
Putting the confidence interval on its head 74
4.6 Other critical issues regarding significance testing 77
4.7 Summary and further reading 79
Notes 79

5 The assumptions of regression analysis  81


5.1 Introduction 81
5.2 The assumptions you should think about before doing your
regression and, quite often, before you choose your data (A) 81
5.3 The testable assumptions not requiring a random sample
from a population-​framework (B) 82
5.4 The additional testable assumptions for a random sample
from a population-​framework (C) 87
5.5 Summing up the regression assumptions, some other
issues and further reading 92
Notes 93

a.indd 6 27-Aug-19 15:03:27


vii

Contents vii
Exercises with data and solutions 94
Informal glossary to Part 2 96

PART 3
The extensions  99

6 Beyond linear regression: Non-​additivity, non-​linearity


and mediation  101
6.1 Introduction 101
6.2 Non-​additivity (assumption B1): Interaction/​moderator
regression models 101
6.3 Non-​linearity 1 (assumption B2): Quadratic regression
models 105
6.4 Non-​linearity 2 (assumption B2): Sets of dummy
variables 107
6.5 Non-​linearity 3 (assumption B2): Use of
logarithms 108
6.6 Mediation models 113
6.7 Summary and further reading 116
Notes 116

7 A categorical dependent variable: Logistic (logit)


regression and related methods  120
7.1 Introduction 120
7.2 The linear probability model (LPM) 121
7.3 The standard multiple logistic (logit) regression model 123
7.4 Interpreting and presenting the results of the logistic (logit)
regression model 124
7.5 Interaction models, non-​linearity and model fit 127
7.6 Extension: When y has three or more unordered alternatives
(multinomial logistic regression) 130
7.7 Other related issues and further reading 132
Notes 133
Exercises with data and solutions 134

8 An ordered (ordinal) dependent variable: Logistic (logit)


regression  136
8.1 Introduction 136
8.2 The multiple ordinal (logit) regression model 137
8.3 Interpreting and presenting the results of the ordinal (logit)
regression model 139
8.4 The proportional odds assumption, non-​linearity and
interaction effects 141
8.5 Alternative modeling strategies 144
8.6 Using linear regression on ordered (ordinal) y-​variables:
Is it justified? 146

a.indd 7 27-Aug-19 15:03:27


viii

viii Contents
8.7 Final thoughts and further reading 147
Notes 148
Exercises with data and solutions 148

9 The quest for a causal effect: Instrumental variable (IV)


regression  150
9.1 Introduction 150
9.2 A first take on the relationship between exercising (x1)
and health (y) 151
9.3 The assumptions of IV regression: Instrument relevance
and instrument exogeneity 153
9.4 When x1 and/​or y are dummies 156
9.5 Final thoughts and further reading 158
Notes 159

PART 4
Regression purposes, academic regression projects and the
way ahead  161

10 Regression purposes in various academic settings and


how to perform them  163
10.1 Introduction 163
10.2 Four purposes of regression analysis: (1) prediction,
(2) description, (3) making inferences beyond the data
and (4) causal analysis 163
10.3 The structure of an academic regression-​based study 171
10.4 Some practical advice 173
Notes 174

11 The way ahead: Related techniques  176


11.1 Introduction 176
11.2 Related technique 1: Median and quantile regression 176
11.3 Related technique 2: Count data regression 177
11.4 Related technique 3: Panel data regression and
time-​series regression (longitudinal data) 180
11.5 Related technique 4: Multi-​level regression models 181
11.6 Related technique 5: Matching 182
11.7 Other, related techniques and further reading 182
Notes 182

References  184
Subject index  188

a.indd 8 27-Aug-19 15:03:27


ix
newgenprepdf

Acknowledgements

Thanks to Andy Humphries for the opportunity to write this textbook for
Routledge. Thanks also to three reviewers for very valuable suggestions to
the first draft of the book, and to Henning Bueie for initially helping me with
some of the book’s figures.
I published a textbook in Norwegian on regression analysis in 2017. Some
of the material in the present text, especially in Chapters 1 and 2, draws heavily
on that book. Also, some of the examples in the present text are adaptions
and extensions from my 2017 textbook. It goes without saying that all those
contributing to my 2017 textbook also deserve a thank you in regard to the
present book.
Lillehammer, Norway, August 24, 2019
Christer Thrane

a.indd 9 27-Aug-19 15:03:27


x

a.indd 10 27-Aug-19 15:03:27


1

Part 1

The basics

In Part 1, “The basics”, you will get to know:

• What regression analysis is, how it works and (hopefully) what it can do
on topics related to your own field of study
• How to do (or as we often say: run) a regression analysis using real data
on your own computer
• How to interpret the regression output from statistical programs regarding
what happens in your data

a.indd 1 27-Aug-19 15:03:27


2

a.indd 2 27-Aug-19 15:03:27


3

1 
What is regression analysis?

1.1 Introduction: Sizes and prices of houses


Take a look at Figure 1.1. Each dot in the scatterplot is a house that was sold
during 2017 in a small Norwegian city. On the x-​axis (the horizontal axis)
we have the size of the houses in square meters. On the y-​axis (the vertical
axis) we have the sale price of the houses in euros. What does the figure tell
us? Well, several things in fact. First, the scatterplot has a distinct physical
shape: most houses are located either in the upper right corner or in the lower
left corner. More precisely, houses larger than 150 square meters tend to cost
more than 0.4 million €, whereas houses smaller than 150 square meters typ-
ically cost less than 0.4 million €. Generally, we could even say that the larger
the house, the costlier it tends to be –​although there are some exceptions.
Second, the figure shows us a statistical relationship, i.e. how house size is
associated or covaries with sale price. This statistical relationship is posi-
tive: more of x (a larger house) tends to go along with more of y (a pricier
house); less of x (a smaller house) tends to go along with less of y (a less
costly house).
Now take a look at Figure 1.2. The only new feature here is the diagonal
line going from the lower left of the figure and up towards the upper right (in
Stata this line is in red). This line has a practical purpose: start at any point on
the x-​axis, say, 150 square meters. Move straight up to the diagonal line, and
from there go straight to the y-​axis on the left. Here you hit the y-​axis at about
0.5 million €. That is, the average sale price for a 150 square meter house is
0.5 million €. If you want to find the mean price of a larger or smaller house,
repeat the process from another starting point. In plain language the diagonal
line summarizes the typical trend in the statistical relationship between prices
and sizes of houses. In more accurate terms, the diagonal line is called the
regression line. And that is what this book is all about.
Nothing in regard to Figures 1.1 and 1.2 is difficult to grasp, and this sums
up the motivation behind this book. Regression analysis –​which in many ways
boils down to estimating and interpreting red lines like the one in Figure 1.2 –​ is
not complicated. But this begs a question: If regression is “easy” as I claim, why
do we need yet another book on the subject? The answer is coming up shortly.

a.indd 3 27-Aug-19 15:03:27


4

Figure 1.1 House sale prices by house sizes. 100 houses sold in a small Norwegian city
during 2017.

Figure 1.2 House sale prices by house sizes, with regression line. 100 houses sold in a
small Norwegian city during 2017.

a.indd 4 27-Aug-19 15:03:29


5

What is regression analysis? 5

1.2 What this book is about and why it has been written
This book is about the statistical method called regression analysis. Or just
regression. Regression is the work horse among the statistical methods used in
the economic and social sciences. For this reason, there are lots of textbooks
around which, in turn, raises a question: What has this book to offer that
others don’t? The short answer is that this book, contrary to others, is built on
the “wicked” idea that one does not need to know much about mathematics
and statistics to understand regression analysis and to perform it to perfec-
tion. Five more detailed answers are:

1. Most regression textbooks overcomplicate matters and overuse symbols,


equations, formulas and math. For those who understand math, this is
not a problem. For those who don’t, and of which there are many more, it
is a severe hindrance to learning. This book thus downplays the technical
features to the greatest extent possible and focuses instead on intuition
and worked examples.
2. Many regression textbooks explain what regression is and how it works in
a satisfactory manner. But most are silent on the practicalities involved in
doing regression analysis, typically alone on your computer. In this book,
doing regression occupies the driver’s seat, whereas the (often too tech-
nical) explanations ride shotgun.
3. Many regression textbooks have the words “applied” or “practical” in
their title. But most of them are neither. Putting a computer printout on
every tenth page or so does not make a book practical or applied. But
letting almost every example go hand in hand with computer outputs
from the analyses of real data, as well as encouraging students to repli-
cate the analyses on their own computers, might do the trick. The latter
approach is taken in this book.
4. Many regression textbooks have narrow perspectives in the sense that
examples are taken from specific themes or fields. This book instead
uses examples that all students and people in general can relate to. Some
include how much or how little people exercise, the variation in prices
for alcoholic beverages, football players’ variation in earnings, to name
but a few.
5. Few, if any, regression textbooks devote serious attention to the proper
interpretation and presentation of regression results –​whether the end
project is a bachelor’s thesis, a master’s thesis or a scholarly journal art-
icle. This book does.

1.3 Some terminology and a word about statistical software


Regression examines if and how a phenomenon or trait is statistically related
to, or associated with, another phenomenon/​trait. But in the method speak
of the social and economic sciences, we do not use the terms “phenomenon”

a.indd 5 27-Aug-19 15:03:29


6

6 The basics
or “trait” –​we write and speak of variables. The size of houses varies among
houses; it is therefore a variable. The sale price of houses also varies among
houses; it is yet another variable. An obvious rule of thumb: A variable is
something that varies. Also, we make a distinction between independent and
dependent variables. An independent variable is something that possibly
brings about statistical variation in another variable; the size of houses is thus
the independent variable in our example.1 (The notes in the book are placed
at the end of each main chapter.) The dependent variable is the variable we
are interested in explaining, i.e. something that possibly is being affected by
another, independent variable. Thus, the sale price of houses is the dependent
variable in our example. But since dependent and independent variables are
long and heavy names in terms of usage, we often opt for y as shorthand for
the former and x as shorthand for the latter. Figure 1.3 illustrates.
Other popular names for the dependent variable are outcome variable,
response variable, criterion variable and regressand. Similarly, for the inde-
pendent variable, we have explanatory variable, predictor variable, treatment
variable, covariate and regressor.
Units or observations are key concepts when doing (or running) a regres-
sion analysis. The units or observations in our example are houses. More
typically, the units are persons in the economic and social sciences, but they
could be almost anything: firms, countries, products, stocks etc. When people
who have answered a questionnaire are the units, we call them respondents
or subjects. More formally, the units/​observations are those we have data or
variable information about.
To actually do a regression analysis we need two things. First, we need
data, i.e. variable information on a set of units. The data sets used in this
book are available for download from the book’s website. Second, we need
access to a computer with statistical software. In this book I will use Stata

Independent variable Dependent variable

Size of houses Sale price of houses

x y

Figure 1.3 Independent and dependent variable.


The arrows symbolize the assumed directions of the statistical relationships/​associ-
ations, i.e. from the independent to the dependent variable.

a.indd 6 27-Aug-19 15:03:29


7

What is regression analysis? 7


(www.stata.com) for illustration purposes. In my view, Stata is the statistics
program that best combines a user-​friendly environment with a capability to
tackle any statistical problem. (By the way, it seems I’m not alone in that
assessment given Stata’s rise in popularity among economists and other social
scientists since 2005.) In Part 1 of the book, however, I will also show some
of the analyses using the statistics program SPSS and the cost-​free statistics
program R (www.r-​project.org/​). Note, however, that Excel suffices for doing
the regressions in Part 1 of the book. Beyond that you need sophisticated
statistics software like SAS, SPSS, Limdep, R or Stata.

1.4 Research questions in the social sciences and economics:


The link to regression analysis
Quantitative social science and economics is generally about studying statis-
tical relationships (or associations) between an independent and a dependent
variable. These variables, x and y for short, could be almost anything. Our
imagination is the only thing that sets boundaries on what could be x, y and
units/​observations in a given regression analysis. Some examples include:

• The statistical relationship between performance (x) and pay (y) for the
soccer players in the Premier League, if any?
• The statistical relationship between innovation activity (x) and revenue
(y) for 150 firms in the service industry, if any?
• The statistical relationship between the quality of wines (x) and their
price (y) for 275 bottles of red wine, if any?
• The statistical relationship between class size (x) and class-​ averaged
grades (y) for 180 high school classes, if any?
• The statistical relationship between total hours of exercise per week (x)
and self-​assessed health (y) for 350 women aged 40 to 50 years, if any?
• The statistical relationship between number of days at a travel destination
(x) and total trip expenditures (y) for 400 travelers, if any?

Regression fits the bill in all of the above examples. That is, regression analysis
examines if and how an independent variable, x, covaries or correlates with a
dependent variable, y, for a given set of units/​observations.2

1.5 The book’s intended audience, pedagogical approach and what


lies ahead in Part 1
This book is targeted at students wishing to do regression analysis, and not
just to understand how it works. That said, some rudimentary knowledge of
how to do it must precede the actual doing. In this regard, the book assumes
no prior knowledge in statistics or mathematics. On the contrary, my ambi-
tious pedagogical goal is to explain and illustrate only the mathematical and
statistical concepts needed along the way to understand the situation at hand.3

a.indd 7 27-Aug-19 15:03:30


8

8 The basics
In doing so, I will eschew symbols, equations and formulas to the greatest
possible extent and instead adopt the example-​driven show-​and-​don’t-​tell
approach.
The remainder of Part 1 of the book is organized as follows. Chapter 2
explains and illustrates so-​called bivariate or simple regression analysis in all
its forms, i.e. a regression with only one independent variable. We start by
more accurately describing the variable relationship mentioned in the opening
of the book. This chapter is the stepping-​stone for the rest of the book.
Chapter 3 extends bivariate regression by incorporating several independent
variables into the regression simultaneously. This is called multiple regression,
and we will cross that bridge when we get to it.

Notes
1 I use the phrase “…something that possibly brings about statistical variation
in another variable” with intention, to emphasize that we are dealing with non-​
deterministic statistical relationships (i.e. associations, correlations, covariations)
and not necessarily causal relationships (i.e. cause-​and-​effect relationships). If not
stated otherwise, the examples in the book are about statistical associations and not
causal relationships. I will say more on using regression analysis to reveal causal
relationships in Chapters 9 and 10 and sections 11.4 and 11.6.
2 In Chapter 3 we will incorporate several independent variables simultaneously in a
regression analysis.
3 For this reason, the book does not start with a crash course in statistics. In my
experience this is a waste of time for (the many) students who fear statistics or
math. For those who do not (the few), it is just redundant. Neither do I devote
a chapter to variables and their measurement levels; nothing about measurement
levels is required knowledge to get a first, firm grip on regression. But if you still
want to know or could do with a recap, the appendix to this chapter offers a
primer on variables and their measurement levels and how this relates to regression
analysis.

Appendix: A short primer on variables and their measurement levels


and how this relates to regression analysis
A variable is any information characteristic or trait we can measure/​collect
for a unit or observation, such as persons, houses, cars, stocks, schools, coun-
tries, firms etc. These characteristics or traits vary among units/​observations.
Gender, height, earnings, years of education, exercise hours, health level,
dietary habits, expenditures on food, expenditures on tourism and so on vary
among adults in a given country. Number of employees, number of years
since startup, production processes, the gender of the CEO, size of marketing
budgets, revenue per employee and so on vary among businesses. Price,
quality, mileage, make and color vary among used cars for sale.
A way of grouping variables is according to their measurement level. In this
regard we make a broad and slightly simplified distinction between numerical

a.indd 8 27-Aug-19 15:03:30


9

What is regression analysis? 9


(or quantitative) variables on the one hand and categorical (or qualitative)
variables on the other.

Numerical variables The sale price of houses, hours of exercising per week,
annual earnings and size of budget are numerical variables in the sense
that they are interval-​scaled. That is, their values are numbers with specific
distances between them; a house costs for e­xample 100 000 €, 200 000 €,
300 000 € or so on, and the distance between 100 000 € and 200 000 € is the
same as the distance between 300 000 € and 400 000 €. (Note also that 340 000
€ is twice as much as 170 000 €.) In the same vein, someone who exercises for
six hours per week exercises three more hours (or twice as much) as one who
exercises for three hours per week.

Categorical variables Put bluntly there is no natural (interval) scale for strict
categorical variable; a person is either male or female, a firm has either a female
or a male CEO and a company produces either services or physical products
and sometimes both. That is, there are no low or high values for strict categor-
ical variables; they are unordered with an either/​or logic to them. The variable
gender has two unordered values (male or female); the variable season has
four unordered values (spring, summer, autumn or winter). In statistical ana-
lysis we still assign numbers to these categorical states, such as 1 for woman
and 0 for man or vice versa, 0 to 3 for the four seasons and 1 for yes and 0 for
no. But we do not attach any low or high interpretations to these numbers.

The in-​ between measurement scale: ordered (ordinal) variables Some


variables fall midway between pure numerical and strict categorical. These
variables are ordered/​ordinal. For example, sociologists categorize people into
classes based on their occupations: low class, middle class and upper class;
in questionnaires you get asked about your opinion to a statement in which
the answer categories are strongly disagree, disagree, neither/​nor, agree and
strongly agree. Both variables –​social class and the opinion –​have values or
answer categories in which there is an ordering. That is, upper class is “higher”
than middle class which, in turn, is “higher” than low class; a person agreeing
is “more in agreement” with an opinion than a person disagreeing. Again,
we assign numbers to these categorical states, such as 0 for low class, 1 for
middle class and 2 for upper class and 1 to 5 for the agreement categories. But
although there is an ordering of the answer categories, making it sensible to
speak of “larger” or “smaller” or “higher” or “lower”, there is no well-​defined
distance between them as we saw above for the numerical variables (see also
Chapter 8).
Questionnaires often ask about income using brackets (0–​9 999, 10 000–​
19 999 and so on), or how often one exercises using intervals (never, 1–​2 times
per week, 3–​4 times per week or 5 times per week or more). Note that these
variables in their present form are ordinal, although in real life they are nat-
ural numbers (and hence numerical variables). The distance from, say, “never”

a.indd 9 27-Aug-19 15:03:30


10

10 The basics
to “1–​2 times per week” is not the same as the distance from “3–​4 times per
week” to “5 times per week or more”. For more on variables and their meas-
urement levels, see for example Agresti and Finlay (2014).

How variables’ measurement level links to regression analysis In Chapters 2


to 6 in this book the dependent variable in question, y, is a numerical variable.
That is, Chapters 2 to 6 cover what we normally refer to as linear regression
and its extensions. In Chapter 7, in contrast, y is a strictly categorical variable,
whereas Chapter 8 covers the situation in which y is ordered (ordinal). In
some applications, but not all, non-​numerical y-​variables call for more tailor-​
made regression models than the default linear. We’ll return to the particulars
in this respect when we get there.

a.indd 10 27-Aug-19 15:03:30


11

2 
Linear regression with a single
independent variable

2.1 Prices of houses once more


Figure 2.1 is the figure we saw in section 1.1; the diagonal regression line
summarizes how the size of houses (x) is positively related to, or associated
with, the sale price of houses (y) on average. But we also notice a few
exceptions to this general trend. For example, the dotted arrow points to
a house that is rather spacious but also quite cheap in relative terms. The
dashed arrow, in contrast, points to a small house that is relatively costly.
The take-​ home message for quantitative social science and economics,
however, is that we are generally not interested in these anomalies; we are
interested in the typical trend captured by the diagonal line. The main theme
in much statistical analysis, in other words, is to simplify and make sense of
“a cloud of dots”.1 In this regard, regression analysis is unparalleled as a
research tool.
We need to know two things to estimate the diagonal regression line
in Figure 2.1: (1) where on the y-​axis the regression line should start, and
(2) how steep the regression line should be. Provided this, it is possible to
draw any straight line in a x-​y plot. For most practical purposes, however, we
are more interested in the regression line’s steepness than in where it crosses
the y-​axis. In our example we note that the regression line in fact does not
cross the y-​axis at all, which of course is because none of the houses in the
data are zero square meter. (Most of us require just a little more space.)
But we can “mentally” extend the line and easily imagine such a theoretical
crossing point.
As mentioned in Chapter 1, we use data and statistical software to esti-
mate (1) and (2), i.e. the exact location of the regression line. On the book’s
website you find the data used to create Figure 2.1 (housedata.dta),
and these are loaded into Stata in Stata-​output 2.1. Here, I have told Stata
to produce descriptive statistics for the variables sale price of the houses
(price_​Euro) and size of the houses (square_​m). I wrote “sum price_​
Euro square_​m” in the command window at the bottom and pressed Enter

a.indd 11 27-Aug-19 15:03:30


12

12 The basics

.8
Sale price in million euros
.2 0.4 .6

50 100 150 200 250


Square meters

SP_mills Fitted values

Figure 2.1 House sale prices by house sizes, with regression line. 100 houses sold in a
small Norwegian city during 2017.
Graph generated in Stata.

to get the results in the output. As you see, this command reappears above
the results. The Courier font is used for commands and variable names
throughout the book. For those familiar with a point-​and-​click routine to
do statistical analysis (like in SPSS; cf. Appendix B to this chapter), Stata
also offers this capability; see Appendix A to this chapter or Stata’s YouTube
channel www.youtube.com/ ​ c hannel/ ​ U CVk4G4nEtBS4tLOyHqustDA.
Stata’s internal help menu system also provides a wealth of valuable infor-
mation and resources for the newbie.
Before turning to the practicalities of regression, let’s stop and go through
the preliminaries in Stata-​output 2.1. We have 100 observations (houses) for
both variables; see under the heading “Obs”. The mean or average sale price
is almost 443 000 €, and the mean house size is almost 141 m2; see under the
heading “Mean”. The standard deviation (“Std. Dev.”) –​which we can
think of as the average deviation from the mean –​is about 141 000 for sale
price and 39 for square meter.2 The minimum (“Min”) and maximum (“Max”)
refer to the smallest and largest value of each variable.

a.indd 12 27-Aug-19 15:03:30


13

Linear regression 13
Stata-​output 2.1 Descriptive statistics for the variables price_​Euro and square_​m.

The general command for doing a regression in Stata is simply “regress


y x”, where y denotes the dependent variable and x the independent.3 (It
suffices to write “reg” for short.) In the present case the analogous “regress
price_​Euro square_​m” yields the results in Stata-​output 2.2.
The regression line’s steepness is the main issue. The regression coef-
ficient is the measure of interest in this regard, and we find it under the
heading “Coef.”. This coefficient measures the regression line’s steepness,
i.e. what happens to y in the data on average when x increases by one unit.
In this case this translates literally, but not strictly correctly, into: If a house
increases by one square meter, its sale price increases by 2 630 €. Or, more
correctly, since we are comparing different houses at one point in time, that
a 101 square meter house on average costs 2 630 € more than a 100 square
meter house.4 From this we calculate the price difference between a 101
square meter house and a 121 square meter house to be 52 600 € on average
(2 630 × 20).

a.indd 13 27-Aug-19 15:03:31


14

14 The basics
Regression of house sale prices (price_​Euro) on house sizes
Stata-​output 2.2 
(square_​m).

Where does the regression line cross the y-​axis? Stata estimates this to be at 73
130 €, i.e. the number below the regression coefficient. In Stata this figure is called
“_​cons”. There are many other numbers of interest in the output, but they need
not concern us now. We return to some of them in Chapters 3 and 4. (The output
from statistics programs generally includes more than we really need!)
Do-​files in Stata. When doing regression (as well as statistical analysis in general),
I always encourage students to write down the commands they use. This is smart
for two reasons. First, you do not need to memorize all commands, i.e. you may
copy and paste a lot. Second, it makes it easier to re-​run a regression a year later to
obtain the exact same results. The outputs from Stata presented so far in this book
are based on the do-​file (i.e. Stata’s name for a command file) in Stata-​output 2.3.

Stata-​output 2.3 Stata do-​file generating the output from Stata in this book so far.

a.indd 14 27-Aug-19 15:03:31


15

Linear regression 15
Short explanation of Stata do-​file. Line 1 tells Stata that I am using Stata version
15.0 if something in the command structure is to change in later versions; line
2 and 3 is useful to include in all do-​files for reasons we do not go into. Line 5
starts with //​, which means that Stata neglects what is written on this line. In
other words, such lines are where we write notes to ourselves. Line 6 tells Stata
where to find the data, which on my PC is at “C:\Users\700245\Desktop\
temp_​myd\housedata.dta”. (On your computer this path, of course, will
be different.) Line 9 yields the descriptive statistics in Stata-​output 2.1, whereas
line 12 yields the regression in Stata-​output 2.2. Line 15 tells Stata to generate
(“gen” for short) the variable SP_​mills (sale price in millions), and this vari-
able is then used to create the two graphs in the book (lines 17/​18 and 22/​23). The
sign /​* at the end of the line tells Stata to jump to the next line starting with */​.
The R program. R is gaining popularity among economists and social scientists.
For this reason and the fact that the program is free of charge, the initial analyses
and results in the book will also be presented in R. The commands will be written
in the more user-​friendly add-​on to R called RStudio (www.rstudio.com/​).
R is a purely command-​driven program, and it thus has some common ground
with writing do-​files in Stata, although the command language is a bit different.
R-​output 2.1 illustrates. Note that the data file name ends with the suffix “.csv”.
This is the preferred (spreadsheet) data format used for R in this book.

R-​output 2.1 R command-​file generating the regression output from Stata-​output 2.2,
descriptive statistics (Stata-​output 2.1) and a regression plot similar to
Figure 2.1.

a.indd 15 27-Aug-19 15:03:32


16

16 The basics
Short explanation of R command-​file. In Line 1, # tells R to neglect what is
written on this line. Line 2 and 3 finds the data on my computer. Line 6 and
7 define the variables for the upcoming regression. Line 10 is the regression
command (~ is called Tilde), and line 11 tells R to show the results of this
regression at the bottom of the output. The regression coefficient is 2 630
€ (under the heading “Estimate”) and the regression line crosses the y-​
axis at 73 130 € (the number right above the regression coefficient). In R this
figure is called the “Intercept”. Lines 14, 15 and 16 produce the graph
in Figure 2.2. Lines 19 and 20 are the commands for generating descriptive
statistics for the two variables used in the regression analysis.
A related example. We have seen a positive relationship between house size
and sale price; is it also an association between the price of houses and where
they are situated? The housing data contains a variable measuring the distance
from city center to where the house is located (dist_​city_​c). This vari-
able measures the distance in question in kilometers (km), and the descriptive
statistics are presented in Stata-​output 2.4.
Note that the average house is situated slightly more than 3 km from the
city center, whereas the closest one is located only 600 meters (0.6 km) from
it. How is this distance-​variable related to sale price? A regression analysis
answers this question, and the results appear in Stata-​output 2.5.

0.8
Sale price in million euros

0.6

0.4

0.2

0.0
50 100 150 200
Square meters

Figure 2.2 House sale prices by house sizes, with regression line. 100 houses sold in a
small Norwegian city during 2017.
Graph generated in R (see R-​output 2.1).

a.indd 16 27-Aug-19 15:03:32


17

Linear regression 17
Stata-​output 2.4 Descriptive statistics for the variable dist_​city_​c.

The regression coefficient is about –​25 000 €. This implies that a house
located 5 km from the city center on average costs 25 000 less than a house
situated 4 km from the city center (i.e. a static interpretation; cf. note 4). More
generally, we have that the further away from the city center a house is located,
the cheaper it tends to be relatively speaking.
Stata do-​file. In terms of obtaining the above results with the Stata do-​file (see
Stata-​output 2.3), the only new code to write is “sum dist_​city_​c” on
one line (descriptive statistics) and “regress price_​Euro dist_​city_​
c” (regression command) on the next line. We’ll skip that.
Stata-​output 2.5 Regression of house sale prices (price_​Euro) on distance to city
center (dist_​city_​c).

a.indd 17 27-Aug-19 15:03:33


18

18 The basics

.8
Sale price in million euros
.2 .4
0 .6

0 2 4 6 8 10
Distance to city center, km

SP_mills Fitted values

Figure 2.3 House sale prices by distance to city center, with regression line. 100 houses
sold in a small Norwegian city during 2017.
Graph generated in Stata.

R command-​file. To get the results above with the R command-​file (see R-​
output 2.1), you need to replace square_​m with dist_​city_​c in line 7
(two times) and lines 10 and 20. We’ll skip that also.
Figure 2.3 is a plot of the above regression and demonstrates the downward-​
sloping regression line.5 (The only new features in the do/​command-​files is
that dist_​city_​c replaces square_​m in the graph commands.)
Two key insights emerge from the analyses and figures above: (1) A positive
regression coefficient implies an upward-​sloping regression line, whereas (2) a
negative regression coefficient means a downward-​sloping regression line.
A positive regression coefficient indicates a positive correlation or association
between x and y; a negative regression coefficient suggests a negative correl-
ation or association between x and y. A regression coefficient of 0, or zero
correlation/​association, implies an exact horizontal regression line, i.e. that x
is not statistically related to/​associated with y.6
If the above analyses and comments were not too hard to grasp, I can
assure that you now have gotten the gist of regression analysis. In section 2.2
we formalize the ideas behind regression a bit more, following up on the rela-
tionship between the size and sale price of houses.

a.indd 18 27-Aug-19 15:03:33


19

Linear regression 19

2.2 Prices of houses: The regression coefficient (b1) and the


constant (b0)
In Figure 2.1 we saw that the relationship between the price of houses and the
size of houses may be summarized by a straight (linear) regression line. This
regression line, in turn, can be expressed with the simple equation
sale price = where the regression line crosses the y-​axis + regression
coefficient × house size,
or
sale price = 73 130 + 2 630 × house size.
“Where the regression line crosses the y-​axis” is in regression lingo called the
constant or intercept (73 130),7 whereas the “regression coefficient” is called
the slope (2 630).
To simplify matters in this chapter and for the rest of the book, let’s say
house size = x and sale price = y. (Yes, this is a simplification!) Furthermore,
let’s denote b0 as the constant and b1 as the slope. Then we have the general
equation for the bivariate or simple regression line; that is
y = b0 + b1x.8
Figure 2.4 shows this equation visually. Note that the regression line crosses
the y-​axis at b0 (i.e. for x = 0), and that b1 is the change in y as x increases by
one unit. Since the line is straight (linear), it does not matter where on the
x-​axis we start when doing the “increase by one unit”-​step.

b1

b0

x x+1 x

Figure 2.4 The regression line, the constant (b0) and the regression coefficient/​
slope (b1).

a.indd 19 27-Aug-19 15:03:34


20

20 The basics

2.3 Prices of houses: An independent variable with two values


(a dummy variable)
The independent variables considered so far –​house size and distance to city
center –​are both numerical. That is, their values range from small to large
numbers (see the appendix to Chapter 1 if this is unfamiliar). Other inde-
pendent variables, in fact quite a lot, do not have such a measurement scale
(again see the appendix to Chapter 1 if this is unfamiliar); a case in point in
our house data is the variable “renovate”. This categorical variable has
only two values: recently renovated or not recently renovated. In this regard,
recently renovated (or yes) is assigned the value of 1, whereas not renovated
(or no) is given the value of 0.9 The results of the regression between sale price
and this renovation variable is presented in Stata-​output 2.6.10 Nothing here is
new in terms of code for Stata do-​files or R command-​files.

Stata-​output 2.6 Regression of house sale prices (price_​Euro) on whether a house


is renovated or not (renovate).

Recall the interpretation of the regression coefficient/​slope/​b1: Change in


y when x increases by one unit. Here, since 0 and 1 are the only values of x
(i.e. renovate), this implies that an “increase” from non-​renovated (0) to
renovated (1) entails a house price increase of 77 300 € in a dynamic sense. Or,
in a more correct and static sense (see note 4), that renovated houses sell for
77 300 € more than non-​renovated houses on average.

a.indd 20 27-Aug-19 15:03:34


21

Linear regression 21
The renovation variable has two distinct values, 0 and 1. The general term
for this binary variable is a dummy variable or dummy. Recall the general
regression equation y = b0 + b1x from section 2.2, and assume now that x is
the renovation variable. Then we have (approximately) that

sale price = 406 000 + 77 300 × renovation variable.

When the independent variable is a dummy, the constant/​b0 has a clear inter-
pretation. We see this by plugging the two possible values for x (0 and 1) into
the equation. For non-​renovated houses (x = 0) this yields

sale price = 406 000 + 77 300 × 0 → sale price = 406 000 + 0 →


sale price = 406 000 €.

That is, the constant/​b0 refers to the mean price of houses not renovated. Here,
such houses cost 406 000 € on average.
In general, the constant/​b0 refers to x = 0, i.e. where the regression line
crosses the y-​axis (cf. Figure 2.4). The similar calculation or prediction for
renovated houses (x = 1) yields

sale price = 406 000 + 77 300 × 1 → sale price = 406 000 + 77 300 →
sale price = 483 300 €.

The mean difference in sale price between non-​renovated and renovated


houses is the difference between 483 300 and 406 000 €; that is, 77 300 or b1.11
Other research questions involving a dummy independent variable. Making a
few adjustments to the research questions posed in section 1.4, we could also
examine:

• Whether better-​performing soccer players in the Premier League receive


more pay than lower-​performing players
• Whether innovating firms obtain larger revenues than non-​innovating
firms
• Whether red wines from Burgundy are pricier than red wines from
Bordeaux
• Whether classes with less than 20 pupils outperform classes with more
than 20 pupils in terms of class-​averaged grades
• Whether women aged between 40 and 50 years and exercising for three
hours or more per week report better self-​assessed health than similar
women exercising for less than three hours per week
• Whether travelers staying at a destination for five days or more have larger
trip expenditures than travelers staying for less than five days

a.indd 21 27-Aug-19 15:03:34


22

22 The basics
As before only our imagination sets the boundaries on what might be the
dependent variable, y, and the independent dummy, x, within a regression
framework. And on that note it is time for some new research questions on
new data.

2.4 From data on houses to data on soccer players: On performance


and pay in football
So far our units or observations have been houses. This choice was made
for the reason that data on prices and sizes of houses forcefully illustrate
regression. In practice, however, we more commonly analyze persons or
firms in the social and economic sciences. The data up for analysis now per-
tain to 242 soccer players in the Norwegian version of the Premier League
(PL) from a few years back (football_​earnings.dta). The dependent
variable is yearly earnings in euros (earnings_​E), and the independent
variables are player performance (a numerical variable) and if a player has
appeared in matches for his national team (yes = 1, no = 0; i.e. a dummy).
For each match played every player gets a performance score by an expert
between 1 (worst) and 10 (best). The performance variable is the average of
these scores for the whole season, i.e. the score for each match summed and
divided by total number of matches. Descriptive statistics for the variables
are presented in Stata-​output 2.7.
The mean of yearly earnings is about 118 000 €, and the mean of average
performance is about 4.61 on a scale ranging from 3 to about 6 points. Almost
41% of the 242 players are internationals (see the lower part of the output);
the Stata command “tab int_​player” yields this latter result.
The regression between yearly earnings and player performance is presented
in Stata-​output 2.8. We interpret the regression coefficient in the advisable,
static manner: A player with an average season performance score of 5 points
earns about 63 000 € more on average than a player with an average season
performance score of 4 points.12 Note the negative constant, reflecting that
none of the players of course had an average season performance score of 0
points.
The graphic form of the regression is seen in Figure 2.5. Although there
is a clear positive correlation or association between the two variables, as
indicated by the upward-​sloping regression line, we also note the large spread
(or variation) around this trend. This is the rule rather than the exception
in the application of regression in the social and economic sciences. We are
studying general trends and not set-​in-​stone deterministic relationships with
few or no exceptions.

a.indd 22 27-Aug-19 15:03:34


23

Stata-​output 2.7 Descriptive statistics for the variables earnings_​E, perform-


ance and int_​player.

Stata-​output 2.8 Regression of earnings (earnings_​E) on player performance


(performance).

a.indd 23 27-Aug-19 15:03:35


24

24 The basics

Figure 2.5 Earnings per year by player performance, with regression line. 242 players
in the Norwegian version of the PL.
Graph generated in Stata.

The regression between earnings and the national team dummy is


presented in Stata-​output 2.9. The regression coefficient for this dummy is
97 620; players with appearances for their national teams on average make
almost 98 000 € more than players with no national team appearances. The
latter group of players earns a little north of 78 000 € yearly on average, i.e.
the constant.

Stata-​output 2.9 Regression of earnings (earnings_​E) on national team dummy


(int_​player).

a.indd 24 27-Aug-19 15:03:36


25

Linear regression 25
Taking stock, the two regressions above suggest that performance pays
off. That is, players with better performance and international players (who
presumably play for their national teams owing to better, prior performance)
make more money on average than worse-​performing and non-​international
players.13
In terms of code for do-​files in Stata or command-​files in R, the above
analyses of soccer players’ earnings require nothing that is not covered in
Stata-​output 2.3 or in R-​output 2.1. The only things changing are the variable
names and the name of the data file.

2.5 Data from questionnaires: Female students’ exercise hours


Regressions on data from questionnaires are also common in the economic
and social sciences. The analyses below are based on a survey of 293 female
college students (fem_​stud_​sport.dta). The dependent variable is
hours of weekly exercise (phys_​act_​h), and the independent variables
are membership of a fitness center (yes = 1, no = 0; a dummy) and exercise
interest (minimum interest = 0, maximum interest = 6). The latter variable
is a so-​called index.14 Descriptive statistics for the three variables appear in
Stata-​output 2.10.

Stata-​output 2.10 Descriptive statistics for the variables phys_​act_​h, exercise_​


int and fitness_​ c.

a.indd 25 27-Aug-19 15:03:36


26

26 The basics
The 293 female students on average exercise for about 3 hours per week.
The exercise interest index has a mean score of 4.53 on the scale from 0 to 6,
whereas 43% of the female students are fitness center members (and 57% are
not). The regression between hours of exercise and fitness center member-
ship is shown in Stata-​output 2.11; the analogous regression between hours
of exercise and exercise interest appears in Stata-​output 2.12.
Stata-​output 2.11 says that female members of fitness centers on average
exercise for 2.23 more hours per week than female non-​members (b1), whereas
the constant (b0) implies that female non-​members exercise for a little more
than 2 hours per week on average.

Stata-​output 2.11 Regression of hours of exercise per week (phys_​act_​h) on fitness


center membership (fitness_​ c).

Stata-​output 2.12 shows that female students scoring 5 on the interest index
exercise for 1.37 more hours per week on average than female students scoring
4 on this index (b1). In summary, therefore, membership of fitness centers
and exercise interest is positively associated with female students’ amount of
weekly exercise hours.
The above analyses of female students’ exercise hours require nothing that
is not covered in Stata-​output 2.3 or in R-​output 2.1 in terms of code for do-​
files/​command files. Again, the only things that have changed are the variable
names and the name of the data file.

a.indd 26 27-Aug-19 15:03:36


27

Linear regression 27
Stata-​output 2.12 Regression of exercise per week (phys_​act_​h) on exercise interest
(exercise_​int).

2.6 (Optional) calculation of the regression coefficient (b1) and the


constant (b0) by Ordinary Least Squares (OLS)
How does Stata or any statistics program find the values of b1 and b0? To
show this we use a dataset with few observations. Figure 2.6 presents the
regression between the price of sausages per kilo (in Norwegian Crowns;
NOK) and the sausages’ meat content in percent for 23 types of sausages
offered in Norwegian grocery stores. The data are called sausages.dta.
The figure has the same form as our regressions up until now, and it unsur-
prisingly suggests that sausages with higher meat content tend to be pricier
on average.
Figure 2.6 shows how certain data points (sausages) are located near the
regression line, whereas others are located further away from it. One obser-
vation is exactly on the regression line, and one tangentially so. The vertical
distance from each data point to the regression line expresses how wrong we
are when predicting a price from the regression line rather than from the exact
data point. This distance is called the residual. By squaring these residuals,
i.e. by treating positive and negative residuals in the same manner, and then
adding them together, we get the sum of squared residuals. That done, the
take-​home message of Figure 2.6 is that the diagonal line is the line (among all

a.indd 27 27-Aug-19 15:03:37


28

28 The basics

Figure 2.6 Price of sausages by meat content in percent, with regression line. 23


sausages available in Norwegian grocery stores.
Graph generated in Stata.

possible lines) that minimizes this sum of squared residuals (hence the name
“Ordinary Least Squares”). Wooldridge (2000; see section 2.8) covers the
particulars in this respect.
To get the actual estimate of b1, we apply the formula (where ∑ is a
summation sign)

b1 =
∑ ( x − x )( y − y ) ,
∑(x − x)
2

for which the necessary numbers (i.e. data) are to be found in Table 2.1.
Here, y refers to the price of sausages and x to the sausages’ meat content.
This yields

8381.86
b1 = ≈ 3.555.
2357.43

a.indd 28 27-Aug-19 15:03:40


29

Linear regression 29
Data for the regression presented in Figure 2.6 and the estimates in
Table 2.1 
Stata-​output 2.13.

Obs. y x x –​ x y –​ y (x –​ x )(y –​ y ) (x –​ x )2

1 177 75 –​1.61 13.57 –​21.85 2.59


2 154 86 9.39 –​9.43 –​88.55 88.17
3 210 80 3.39 46.57 157.87 11.49
4 148 70 –​6.61 –​15.43 101.99 43.69
5 149 71 –​5.61 –​14.43 80.95 31.47
6 303 90 13.39 139.57 1868.84 179.29
7 149 64 –​12.61 –​14.43 181.96 159.01
8 83 54 –​22.61 –​80.43 1818.52 511.21
9 111 80 3.39 –​52.43 –​177.74 11.49
10 119 72 –​4.61 –​44.43 204.82 21.25
11 219 79 2.39 55.57 132.81 5.71
12 207 91 14.39 43.57 626.97 207.07
13 146 79 2.39 –​17.43 –​41.66 5.71
14 163 92 15.39 –​0.43 –​6.62 236.85
15 133 74 –​2.61 –​30.43 79.42 6.81
16 179 73 –​3.61 15.57 –​56.21 13.03
17 137 70 –​6.61 –​26.43 174.7 43.69
18 149 75 –​1.61 –​14.43 23.23 2.59
19 282 85 8.39 118.57 994.8 70.39
20 213 81 4.39 49.57 217.61 19.27
21 82 68 –​8.61 –​81.43 701.11 74.13
22 83 59 –​17.61 –​80.43 1416.37 310.11
23 163 94 17.39 –​0.43 –​7.48 302.41
Mean 163.43 76.61
Sum 8381.86 2357.43

Once b1 is found, we apply the formula (where “¯” on top of x and y means
the mean of x and y)

b0= y − b1 x

to get b0. This yields

b0 = 163.43 − (3.555 × 76.61) ≈ −108.9.

Stata-​output 2.13 presents the regression and verifies that our calculations are
correct. But relax, you normally don’t need to do these calculations yourself –​
that’s what statistical software is for!

a.indd 29 27-Aug-19 15:03:42


30

30 The basics
Stata-​output 2.13 Price of sausages (saus_​price) by meat content (meat_​
content).

2.7 A quick word about the error term (e)


Thinking about the variation in prices, earnings and exercise hours, both
intuition and the preceding analyses tell us that several independent
variables might bring about variation in these variables (in an associative,
and not necessarily causal, sense). For houses we saw that size, distance to
city center and renovation status affected prices. Also, performance and
international appearance affected soccer players’ earnings, whereas fitness
center membership and exercise interest entailed variation in students’
exercise hours. Yet we only had one independent variable in each of the
regressions in this chapter. In such cases the effects of all other variables on
the dependent variable are captured by the so-​called error term (e) or dis-
turbance term. That is, the complete bivariate or simple regression equation
should be written as

y = b0 + b1x + e.

This e is an unobserved variable, and we assume that its mean value is 0. For
this reason, we may disregard e in practice (but see section 5.4). Yet the e
reminds us of the fact that many independent variables might in real life affect
our dependent variable. We could even say that multiple regression, the topic
of Chapter 3, to some extent boils down to extracting independent variables
from e and putting them into the actual regression. In this sense, multiple
regression is among other things about reducing the size of the error term (e)
as compared with bivariate regression.

a.indd 30 27-Aug-19 15:03:45


31

Linear regression 31

2.8 Summary, a few words on data types and further reading


Summary. This chapter has been a quick-​and-​dirty introduction to regres-
sion. The aim of bivariate or simple regression is to describe how variation
in an independent variable, x, is statistically related to, or associated with,
variation in a dependent variable, y. Two figures are of key importance in this
regard: the regression coefficient/​b1 and the constant/​b0. The steepness of the
regression line is measured by b1. If b1 is positive, we have an upward-​sloping
regression line; if b1 is negative, the line is downward-​sloping. The constant/​
b0 refers to where the regression line crosses the y-​axis (at x = 0), but it has a
real-​life interpretation only when x = 0 is an actual value in the data.15
It is a bit discouraging to end this chapter with the recognition that
bivariate regression for all intents and purposes is only a pedagogical means
to set the scene for what this book really is about, namely multiple regression.
Yet multiple regression is straightforward if you have understood the basics
of this chapter. In fact, as a practical matter, it just involves telling Stata or R
or some other statistics software that your regression also should include this
or that other independent variable. In terms of interpretation, however, some
new features come to light, and we will return to them in Chapter 3.
A few words on data types. Regression is mainly carried out on two types of
data in the social and economic sciences: cross-​sectional data and longitu-
dinal data. Cross-​sectional regression analysis is about analyzing a number
of units or observations at one specific point in time, i.e. we cut across the
time-​axis. Our regressions on houses, soccer players and female students in
this book are therefore cross-​sectional. Longitudinal regression, by contrast,
follow a number of units (e.g. persons, firms, stock options etc.) over a period
of time (e.g. days, weeks, years) and sometimes also include time itself as an
independent variable. I will not cover regression analysis for longitudinal data
in depth in this book, but see section 11.4.
Another distinction is between observational and experimental data. Most
of the data in this book are observational, this being the norm in the eco-
nomic and social sciences. The difference between observational and experi-
mental data is mostly relevant for the possibility of distinguishing between
statistical associations (for observational data) and causal relationships (for
experimental data). I will return to the topic of causal analysis (i.e. cause-​and-​
effect relationships) in Chapters 9 and 10 and sections 11.4 and 11.6.

Further reading
There are many textbooks and research articles on regression. These have
different strengths and weaknesses, depending on the readers’ background,
skills and motivation. My three personal favorites are Allison (1999), Berk

a.indd 31 27-Aug-19 15:03:45


32

32 The basics
(2004) and Wooldridge (2000 or later editions). Especially the latter covers just
about everything one needs to know about regression analysis.16 Chapter 5 in
Allison (1999), Chapters 2 and 3 in Berk (2004) and Chapter 2 in Wooldridge
are especially relevant for the present chapter.
Other important texts, especially tailor-​made for economists but very read-
able to others as well, are Dougherty (2011) and Gujarati (2015).
Wolf and Best (2015) provide a compact yet very thorough introduction to
regression for the social sciences in general. In the chapters to come I will add
more books and papers to this list.

Notes
1 In essence, when doing statistical analysis, we throw away lots of information on
individual units in order to get a clearer picture of the general trends among the
most typical units. That is, we gain more insight by throwing away information (see
Stigler, 2016)!
2 I said in sections 1.2 and 1.5 that this book presupposes no skills in mathematics/​
statistics. This still applies, but you should know how to do basic calculations from
a set of numbers. If you do not, here is how: The data (numbers) 7, 8, 8, 8, 9, 9, 9,
9, 10, 10, 10, 11 and 11 are the shoe sizes (a numerical variable) of 13 adult males
(the units). To obtain the mean shoe size for these 13 men, you add the 13 numbers
together and divide this sum by 13. This yields 9.15 after rounding. The typical
spread or variation around this mean is measured by the standard deviation. To
obtain the standard deviation, you first find the variance (s2) using the formula

∑( y − y)
2

s 2
= ,
n −1
where ∑ is the summation sign, yi are the individual shoe sizes, y is the mean shoe
size (i.e. 9.15) and n is the number of units (i.e. 13). The standard deviation is then
s = s2 ,
which becomes 1.21 after rounding. You should calculate this on your own as an
exercise!
3 Other statistics programs have similar commands and/​or point-​and-​click routines.
The point-​and-​click routines for Stata and SPSS are shown in Appendices A and B
in this chapter.
4 The first interpretation is dynamic; the second is static. Since we are comparing
different houses at one point in time, a static interpretation makes most sense. We
are not examining what happens to sale prices after an increase in house sizes, which
calls for a dynamic interpretation. Normally we restrict dynamic interpretations to
causal relationships; more on this in Chapters 9 and 10 and sections 11.4 and 11.6.
That said, dynamic interpretations are often more elegant in terms of expressing
how x appears to affect y in the data.
5 It should be mentioned that the results of regressions, especially for data with sev-
eral hundreds or thousands of observations, seldom are presented in a graphic
fashion. In this sense the many graphic displays of regression results in this book
serve mostly a pedagogical purpose.

a.indd 32 27-Aug-19 15:03:47


33

Linear regression 33
6 That is, at least not in a straight or linear way. Non-​linear statistical relationships
between x and y are covered in Chapter 6.
7 This is why Stata calls it “_​cons”.
8 The strictly correct equation, which we will return to in section 2.7, is y = b0 + b1x
+ e, where e is the so-​called error term. For now, we just assume that e = 0.
9 It is customary to denote 1 as presence of something or yes, and 0 as non-​presence
of something or no. But the research question at hand may trump this. Always
make sure variables with two values or categories are coded 1 and 0 in a regression
analysis (and not 1 and 2).
10 Exactly 47% of the houses in the data have been renovated; the descriptive
Stata command “tab renovate” tells you this. In R the similar command is
“table(renovate)”.
11 Since an independent dummy only has two values, it does not yield a graph as nice
as those for numerical independent variables (e.g. Figure 2.1). For this reason, the
results for dummies are seldom presented graphically.
12 Remember: We are comparing different soccer players at one point in time; we
are not analyzing how better or worse performance affect (future) earnings; cf.
note 4.
13 If this book was an introduction to quantitative methods in the social sciences,
we would say that player performance is a theoretical construct and that “average
season performance on a 1 to 10 scale” and “national team appearances” are
operationalizations of this construct. Although the topic of operationalization
is important in quantitative social science, we will not cover it explicitly in
this book.
14 An index is a combination of two or more variables, most typically the sum or
average of these. The exercise interest index is based on two statements (ordinal
variables) with the answer categories apply very poorly (0), apply poorly (1),
apply rather poorly (2), neither poorly nor well (3), apply rather well (4), apply well
(5) and apply very well (6). The first statement reads “Doing physical exercise is
important for me” and the second reads “I am not interested in doing physical
exercise”. The latter statement’s answer categories were reversed (6 = 0, 5 = 1,
4 = 2, 3 = 3, 2 = 4, 1 = 5 and 0 = 6) before the two variables were summed and
divided by two to keep the original scale from 0 to 6. A female student scoring
6 on the index (maximum interest) thus means that she answered apply very well
to the first statement and apply very poorly to the second, which in turn yields
12/​2 = 6. This index is an operationalization of the theoretical construct exer-
cise interest (cf. note 13). For practical purposes we treat this index and similar
ones as numerical variables in the social sciences, although this is not strictly
correct.
15 But if you are to make predictions from your regression analysis (cf. section 2.3),
you should always make sure to include the constant. Otherwise your predictions
will be wrong!
16 The first edition of Wooldridge’s textbook came out in the year 2000, and sev-
eral editions have been published after this. Technically this book is on econo-
metrics, which is economists’ umbrella name for applying statistics/​mathematics
to economic data. In practice, however, this often boils down to doing regression
analysis.

a.indd 33 27-Aug-19 15:03:47


34

34 The basics

Appendix A: A short introduction to Stata’s point-​and-​click routine


To obtain the descriptive statistics in Stata-​output 2.1, click on “Statistics” and
go to “Summaries, tables, and tests” → “Summary and descriptive statistics”
→ “Summary statistics” like in Stata-​output 2.14. The box in Stata-​output
2.15 will then appear.

Stata-​output 2.14 Point-​and-​click routine in Stata: Step 1 for obtaining descriptive


statistics.

a.indd 34 27-Aug-19 15:03:47


35

Linear regression 35
Stata-​output 2.15 The box appearing after following the procedure in Stata-​output 2.14.

Then, open/​click on the blank space for “Variables:” (see arrow) and click/​
select the variables square_​m and price_​Euro to produce Stata-​output
2.16. Then click on OK and Stata-​output 2.17 appears.

Stata-​output 2.16 Selecting the variables square_​m and price_​Euro.

a.indd 35 27-Aug-19 15:03:48


36

36 The basics
Stata-​output 2.17 Results of the procedure from Stata-​output 2.14 to 2.16.

To obtain the regression results in Stata-​output 2.2, click on “Linear


models and related” → Linear regression like in Stata-​output 2.18. The box in
Stata-​output 2.19 will then appear.

Stata-​output 2.18 Point-​and-​click routine in Stata: Step 1 for doing linear regression.

a.indd 36 27-Aug-19 15:03:49


37

Linear regression 37
Stata-​output 2.19 The box appearing after following the procedure in Stata-​output 2.18.

Then, click on/​select the dependent variable (price_​Euro; dashed arrow)


and independent variable (square_​m; dotted arrow) to produce Stata output
2.20. Then click on OK and Stata-​output 2.21 appears.

Stata-​output 2.20 Selecting the variables square_​m (independent) and price_​


Euro (dependent).

a.indd 37 27-Aug-19 15:03:50


38

38 The basics
Stata-​output 2.21 Results of the procedure from Stata-​output 2.18 to 2.20.

Appendix B: A short introduction to the point-​and-​click routine


in SPSS
To obtain the descriptive statistics in Stata-​output 2.1, click on “Analyze” and
go to “Descriptive statistics” → “Descriptives” like in SPSS-​output 2.1. The
box in SPSS-​output 2.2 will then appear.

SPSS-​output 2.1 Point-​and-​click routine in SPSS: Step 1 for obtaining descriptive


statistics.

a.indd 38 27-Aug-19 15:03:50


39

Linear regression 39
SPSS-​output 2.2 The box appearing after following the procedure in SPSS-​output 2.1.

Then, click on/​select square_​m and price_​Euro to get them into the
(active) window on the right-​hand side (→), like in SPSS-​output 2.3. Then
click on OK and SPSS-​output 2.4 appears.

SPSS-​output 2.3 Selecting the variables square_​m and price_​Euro.

a.indd 39 27-Aug-19 15:03:51


40

40 The basics
SPSS-​output 2.4 Results of the procedure from SPSS-​output 2.1 to 2.3.

To obtain the regression results in Stata-​output 2.2, click on “Analyze” and


go to “Regression” → “Linear” like in SPSS-​output 2.5. The box in SPSS-​
output 2.6 will then appear.

SPSS-​output 2.5 Point-​and-​click routine in SPSS: Step 1 for doing linear regression.

a.indd 40 27-Aug-19 15:03:53


41

Linear regression 41

SPSS-​output 2.6 The box appearing after following the procedure in SPSS-​output 2.5.

Then, click on/​select the dependent variable (price_​Euro; dashed arrow)


and independent variable (square_​m; dotted arrow); see SPSS-​output 2.7.
Then click on OK and SPSS-​output 2.8 appears.

SPSS-​output 2.7 Selecting the variables square_​m (independent) and price_​Euro


(dependent).

a.indd 41 27-Aug-19 15:03:54


42

42 The basics
SPSS-​output 2.8 Results of the procedure from SPSS-​output 2.5 to 2.7.

a.indd 42 27-Aug-19 15:03:54


43

3 
Linear regression with several
independent variables
Multiple regression

3.1 Introduction 1: Prices of houses again


We return to the data on houses and their sale prices. Based on intuition
and common sense, we expect that number of bedrooms has a positive effect
on sale price; i.e. that more bedrooms yield pricier houses. Stata-​output 3.1
examines this expectation. (The mean of bedrooms is 3.41; not shown.) We’ll
skip the full Stata screenshots from here on and display only the commands
and results of the analyses.

Stata-​output 3.1 Regression of house sale prices (price_​Euro) on number of


bedrooms (bedrooms).

. use "C:\Users\700245\Desktop\temp_myd\housedata.dta"

. regress price_Euro bedrooms

Source | SS df MS Number of obs = 100


-------------+---------------------------------- F(1, 98) = 22.76
Model | 3.7086e+11 1 3.7086e+11 Prob > F = 0.0000
Residual | 1.5965e+12 98 1.6291e+10 R-squared = 0.1885
-------------+---------------------------------- Adj R-squared = 0.1802
Total | 1.9674e+12 99 1.9872e+10 Root MSE = 1.3e+05

------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms | 61456.68 12880.64 4.77 0.000 35895.47 87017.89
_cons | 233101 45739.87 5.10 0.000 142331.8 323870.3
------------------------------------------------------------------------------

The regression coefficient for bedrooms suggests that a house with three
bedrooms on average costs about 61 500 € more than a house with two
bedrooms. This is much as anticipated. In other words, our regression gives
rise to the visual model depicted in Figure 3.1.
Let’s for a second examine what happens if we also include the independent
variable square meter (square_​m) in the regression, leaving the further tech-
nicalities of this aside for the moment. Stata-​output 3.2 presents the results.
Note that the Stata command simply adds “square_​m” to the previous
regression command in Stata-​output 3.1.

a.indd 43 27-Aug-19 15:03:55


44

44 The basics

Number of bedrooms Sale price

Figure 3.1 Visual model describing how number of bedrooms affect house sale price
positively, based on the regression in Stata-​output 3.1.

Stata-​output 3.2 Regression of house sale prices (price_​Euro) on number of


bedrooms (bedrooms) and square meter (square_​m).

. regress price_Euro bedrooms square_m

Source | SS df MS Number of obs = 100


-------------+---------------------------------- F(2, 97) = 53.48
Model | 1.0317e+12 2 5.1585e+11 Prob > F = 0.0000
Residual | 9.3565e+11 97 9.6459e+09 R-squared = 0.5244
-------------+---------------------------------- Adj R-squared = 0.5146
Total | 1.9674e+12 99 1.9872e+10 Root MSE = 98214

------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms | 3897.93 12107.64 0.32 0.748 -20132.38 27928.24
square_m | 2572.144 310.7542 8.28 0.000 1955.382 3188.905
_cons | 67913.05 40460.62 1.68 0.096 -12390.07 148216.2
------------------------------------------------------------------------------

What happens to the coefficient for bedrooms here? It is no longer 61 500 €,


but only 3 900 €. Whereas the first regression suggests that number of bedrooms
has a huge effect on sale prices, the second tells us that this effect is miniscule.
How come? Short answer (to be expanded later): The first regression compares
prices of houses with unequal numbers of bedrooms (1, 2, 3…) without taking
the size of the houses (i.e. square_​m) into account. The second regression, in
contrast, makes the similar comparison while explicitly taking the size of the
houses into account. That is, the latter comparison is made for houses of equal
size. In essence this is what multiple regression –​i.e. a regression with two or
more independent variables –​does! More on this in section 3.2.
Figure 3.2 captures what happens. The square meter variable has a positive
association with both number of bedrooms and sale price, i.e. the two arrows.
That is, the square meter variable literally makes room for more bedrooms, and
it is also the main driver of sale price. But there is no association, i.e. no arrow,
between number of bedrooms and sale price; a coefficient of 3 900 € is virtually
zero in this regard. The association between number of bedrooms and sale price
is, in other words, false. Or as we say in statistics lingo: spurious. Variables like
square meter in Figure 3.2 are called lurking variables in the social sciences; they
lurk in the background and “explain away” the bivariate association between
two variables (like in Stata-​output 3.1 and Figure 3.1). A formal name for this
type of lurking variable is a confounder. More on this in the next section.
Multiple regression in R. R-​output 3.1 shows how to do a multiple regression
in R, using the same data and variables as in Stata-​output 3.2.

a.indd 44 27-Aug-19 15:03:57


45

Multiple regression 45

Number of bedrooms Sale price

Square meter

Figure 3.2 Visual model showing how the variable square meter affects number of
bedrooms and sale price positively, based on the regressions in Stata-​
outputs 3.1 and 3.2.

R-​output 3.1 R command-​file generating the statistical output of Stata-​output 3.2.

a.indd 45 27-Aug-19 15:03:58


46

46 The basics

3.2 Introduction 2: Social science research questions and the need to


control for a third variable
Section 3.1 exemplified the need to control for a third variable to find out if
and how an independent variable, x, is associated with a dependent variable,
y. For observational data, the norm in the social sciences and hence the typ-
ical data in this book, this is a very common research quest.1 The reason is
that we do not want to claim that x affects y when the fact in real life is that
a third variable, z, is responsible for the false or spurious association between
x and y –​as we saw in section 3.1.2 This problem comes in many shapes and
guises in the social sciences, and multiple regression is typically the preferred
remedy. Here is another and by now familiar take on the problem.
Let’s say someone via questionnaire data and a bivariate regression finds
out that members of fitness centers on average exercise for about two more
hours per week than non-​members (cf. section 2.5). Imagine a journalist
writing a piece about this using the heading “Join a gym to get more exer-
cise!”, implying that joining a fitness club by itself makes people more phys-
ically active. Using the data on female students’ exercise habits, we saw that
fitness center members exercised more hours than non-​members per week
(Stata-​output 2.11). This regression is repeated in the upper part of Stata-​
output 3.3. Note that female fitness center members on average exercise for
2.23 more hours per week than female non-​members (dashed arrow).
The lower part of Stata-​output 3.3 is a multiple regression also including
exercise interest as an independent variable. In this regression the coefficient
for fitness center membership is reduced to 0.92 hours (dotted arrow). We
note the same pattern as in section 3.1, i.e. a large bivariate regression coeffi-
cient and a much smaller regression coefficient in the multiple regression case.
Results in R. Nothing in Stata-​output 3.3 is new compared with R-​output 3.1
in terms of code, except for variable names and a new data file.
What happens here? The bivariate regression compares fitness center members
and non-​members in terms of exercise hours with no regard to other variables
possibly bringing about variation in these hours, such as exercise interest, fitness
motivation, age, working hours etc. The multiple regression, in contrast, does
the analogous comparison between fitness center members and non-​members
among students equally interested in exercising at the outset. And among those
equally interested in exercising at the outset, the difference in exercise hours
between fitness center members and non-​members is only 0.92 hours per week.
In this example, unlike for number of bedrooms in section 3.1, the bivariate
effect of fitness center membership on exercise hours does not vanish
altogether in the multiple regression case. But it gets reduced in size by 59%
((2.23–​0.92)/​2.23 = 0.587). In other words, 59% of the bivariate relationship
between fitness center membership and hours of exercise is spurious. And the
lurking variable or confounder responsible for this effect reduction is the vari-
able exercise interest, which by itself also has a positive effect (b = 1.23) on
hours of exercise per week.

a.indd 46 27-Aug-19 15:03:58


47

Multiple regression 47
Stata-​output 3.3 Regression of hours of exercise per week (phys_​act_​h) on fitness
center membership (fitness_​ c) (upper part) and regression of
hours of exercise per week (phys_​act_​h) on fitness center member-
ship (fitness_​c) and exercise interest (exercise_​int) (lower part).

. use "C:\Users\700245\Desktop\temp_myd\fem_stud_sport.dta"

. reg phys_act_h fitness_c

Source | SS df MS Number of obs = 293


-------------+---------------------------------- F(1, 291) = 36.30
Model | 355.742219 1 355.742219 Prob > F = 0.0000
Residual | 2852.17502 291 9.80128871 R-squared = 0.1109
-------------+---------------------------------- Adj R-squared = 0.1078
Total | 3207.91724 292 10.9860179 Root MSE = 3.1307

------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fitness_c | 2.225656 .3694298 6.02 0.000 1.498563 2.952749
_cons | 2.08982 .2422609 8.63 0.000 1.613015 2.566626
------------------------------------------------------------------------------

. reg phys_act_h fitness_c exercise_int

Source | SS df MS Number of obs = 293


-------------+---------------------------------- F(2, 290) = 64.77
Model | 990.496971 2 495.248485 Prob > F = 0.0000
Residual | 2217.42026 290 7.64627677 R-squared = 0.3088
-------------+---------------------------------- Adj R-squared = 0.3040
Total | 3207.91724 292 10.9860179 Root MSE = 2.7652

------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fitness_c | .9233283 .3562325 2.59 0.010 .2221994 1.624457
exercise_int | 1.232009 .1352184 9.11 0.000 .9658751 1.498143
_cons | -2.934121 .5914624 -4.96 0.000 -4.098224 -1.770017
------------------------------------------------------------------------------

There are several ways of expressing the findings of this multiple regres-
sion: We may say that fitness center members exercise almost one hour (0.92)
more than non-​members, controlling for exercise interest.3 Or we could say the
same, but finish off with holding exercise interest constant. A third way is to
use the ceteris paribus (i.e. all-​else-​equal) clause, meaning that all other inde-
pendent variables included in the regression are being held constant or being
controlled for. In the general multiple regression case of y on x1, x2, x3 and
so on, the regression coefficients for x1, x2, x3 and so on are always controlled
for each other in this way. But when we said above that the multiple regres-
sion compared fitness center members and non-​members equally interested in
exercising, this should be taken a bit metaphorically. In more precise math-
ematical terms, which we will not dive into (see Fox, 1997 or Wooldridge,
2000), multiple regression partitions or isolates the effects of the independent
variables into the unique contributions from each independent variable in the
regression (see also section 3.5).
In this example we controlled the fitness center membership dummy of
key interest (x1) for exercise interest (x2). In practice we typically control
for several independent variables simultaneously (and not only x2), making

a.indd 47 27-Aug-19 15:03:59


48

48 The basics
comparisons between even more homogenous groups: fitness center members
and non-​members who are equal with respect to, say, age, dietary habits, body
weight, fitness motivation etc. The umbrella term for such variables are rele-
vant control variables, i.e. variables we (a) believe affect y and (b) believe to be
correlated with the independent variable of key interest, x1. But even though
using several controls makes multiple regression messier in mathematical
terms, it is by no means more complex to carry out in statistics programs such
as Stata, R, SPSS and the like.
Now for an important thought experiment. Imagine adding two more con-
trol variables, say age and fitness motivation, to the multiple regression at the
bottom of Stata-​output 3.3. (Both are likely to affect exercise hours and to be
correlated with the membership dummy.) Two things might then happen to the
regression coefficient (effect) of fitness center membership: (1) It might remain
at around 0.9 hours. If so, we may add these controls to the ceteris paribus
clause and become more certain that fitness center membership has a unique
effect on hours of exercising, as the journalist claims. (2) The regression coeffi-
cient (effect) might get much smaller and in the extreme case become zero.4 If
so, we can no longer claim that fitness center membership has a unique effect
on exercise hours. The journalist’s claim would, in short, be false. Implication
1: If we want to be certain our independent variable of key interest (here: fitness
center membership) has a larger-​than-​zero effect on the dependent variable in
question (here: hours of exercise), we must always make every possible effort
to include all relevant control variables in our multiple regression.5 Implication
2: It is quite hard to actually become dead sure that our data comprise all rele-
vant control variables in regression-​situations like this…
Formally, the multiple regression equation may be written as

y = b0 + b1x1 + b2x2 + … + bnxn + e,

where subscript n denotes that we can have as many independent variables/​


regression coefficients as we want provided enough observations.6
Other examples. Examples of social science research questions in which
one has to look out for lurking variables (confounders) abound. Below you
find once more, but now with a small twist (in italics), some of the research
questions posed in section 2.3:

• Whether innovating firms really obtain larger revenues than non-​


innovating firms
Comment: The two groups of firms may also differ with respect to
other independent variables than innovation capacity influencing revenue
size (e.g. years since startup, number of employees, size of marketing

a.indd 48 27-Aug-19 15:03:59


49

Multiple regression 49
budget etc.) Before concluding that innovation has this or that unique
effect on revenues based on a bivariate regression, these other inde-
pendent variables should be controlled for in a multiple regression.
• Whether classes with less than 20 pupils really outperform classes with
more than 20 pupils in terms of class-​averaged grades
Comment: The two groups of classes may also differ with respect to
other independent variables than number of pupils influencing grades
(e.g. pupil skills, pupil motivation, school vicinity affluence etc.) Before
concluding that number of pupils has this or that unique effect on grades
based on a bivariate regression, these other independent variables should
be controlled for in a multiple regression.
• Whether women aged between 40 to 50 years and exercising for three
hours or more per week really report better self-​assessed health than
similar women exercising for less than three hours per week
Comment: The two groups of women may also differ with respect to
other independent variables than exercise habits affecting health (e.g.
smoking prevalence, dietary habits, educational level etc.) Before con-
cluding that exercise has this or that unique effect on health based on
a bivariate regression, these other independent variables should be con-
trolled for in a multiple regression.
• Whether better-​performing players in the Premier League really receive
more pay than lower-​performing players
Comment: The two groups of players may also differ with respect to
other independent variables than performance affecting pay (e.g. age,
player position, number of matches played, goals scored in career, player
nationality, number of assists etc.) Before concluding that performance
has this or that unique effect on pay based on a bivariate regression,
these other independent variables should be controlled for in a multiple
regression.7

A final example. Among a group of people unemployed for some time, we


later find that those who participated in a voluntary work training program
were subsequently (a) more likely to get a job and (b) paid higher earnings
than those who chose not to participate in the program. May we attribute the
differences in (a) and (b) to the training program itself ? Short answer: No!
Longer answer: Since the training program is voluntary, those who chose to
participate might be different than those who did not on many other accounts
that may affect (a) and (b), such as skills, motivation, prior working experi-
ence, age, educational level etc. Thus, in order to get the unique or ceteris
paribus effect of the training program itself, these variables must be included
in the multiple regression as controls.

a.indd 49 27-Aug-19 15:03:59


50

50 The basics

3.3 Two motivations for multiple regression


Motivation 1. The insights from sections 3.1 and 3.2 give the first motivation
for multiple regression: to find the unique or ceteris paribus effect of an inde-
pendent variable of key interest, x1, on a dependent variable, y, as expressed
by the size of x1’s regression coefficient/​b1.
Motivation 2. Motivation 2 is related to the first, but requires a bit
more: Phenomena in real life typically change as a result of changes in many
other factors –​and not only as a consequence of changes in one factor. If
you happen to know that your real-​life phenomenon, y, is driven by five
main factors, a regression containing only one independent variable is a
poor representation of reality. A good representation, in contrast, reflects
what happens in real life to a great extent within reason, suggesting five inde-
pendent variables in this case. (If you are to choose between a good represen-
tation and a poor one? Well…) Taking this reasoning across to the realm of
regression analysis we get: If the lion’s share of variation in y is brought about
by five factors, your regression should if possible include all these five inde-
pendent variables.
Sometimes these motivations blend well together, as in when x1 is one of
the five factors that theory or prior research suggest belong in the regression.
If so, you have more reason to believe that b1 expresses x1’s unique effect on
y. I will say more on the various motivations for, or purposes of, regression
analysis in Chapter 10.

3.4 A special case: Sets of independent dummy variables


Multiple regression in essence means that a regression as a minimum contains
two independent variables, x1 and x2, disregarding the extra mathematical
complexities. So far these independent variables have been either numerical
variables or dummies, i.e. variables with 0 or 1 as values. In this section we
take a look at situations in which the independent variables are of a categor-
ical nature with more than two values. To illustrate the general procedure,
however, we only need to use one such categorical variable.
Smoking status and hours of exercise. Is smoking associated with exercise
hours? In order to examine this, we use questionnaire data on 422 male and
female students (all_​stud_​sport.dta). The dependent variable is the
same as before, namely weekly hours of exercise (phys_​act_​h; mean is 3.58
hours). The categorical variable denoting smoking status is called smoke.
The frequency table of smoking status is displayed in Stata-​output 3.4. (The
command generating the output is again “tab”.)
The new feature in Stata-​output 3.4 is that the independent variable
has three categorical values or answer categories. We might add that these

a.indd 50 27-Aug-19 15:03:59


51

Multiple regression 51
categories are ordered, but this is unimportant. Note that most students do
not smoke (69%), whereas 16% smoke sometimes and 15% smoke on a daily
basis. (The data are from some years back; fewer students smoke now.) In the
data never is assigned the value of 0, whereas sometimes is assigned 1 and daily
is assigned 2.

Stata-​output 3.4 Frequency table of smoking status (smoke).

. use "C:\Users\700245\Desktop\temp_myd\all_stud_sport.dta"

. tab smoke

smoke | Freq. Percent Cum.


------------+-----------------------------------
never | 292 69.19 69.19
sometimes | 68 16.11 85.31
daily | 62 14.69 100.00
------------+-----------------------------------
Total | 422 100.00

How may we examine if and how this smoking status variable is associated
with exercise hours? We make two new dummy variables from the three levels
of the smoking status variable, and then use these two dummies in a multiple
regression. (This is the general way to proceed; a faster way within Stata will
be described shortly.)
The upper part of Stata-​output 3.5 first creates the dummy some_​smoke,
indicating whether a student smokes sometimes (1) or not (0). The two
commands are “gen some_​ smoke = smoke” and “recode some_​
smoke 1=1 else=0”. Next, the dummy daily_​smoke is created in the
same manner, save for the small difference in the recoding procedure. Note
that we did not (have to) make a dummy for the never category. At the bottom
of the output, we have the multiple regression results. The constant (_​cons)
refers to the dummy we did not make, i.e. the mean of exercise hours for those
never smoking (i.e. about 4 hours per week).8
Interpretation: Those smoking sometimes (some_​smoke = 1) exercise for
1.33 less hours per week than those who never smoke (the constant); those
smoking daily (daily_​smoke = 1) exercise for 1.62 less hours per week
than those who never smoke (the constant). The difference in exercise hours
between those smoking daily and those smoking sometimes is therefore about
0.29 hours (1.62–​1.33 = 0.29). Smoking appears to be associated with exercise
hours, although the difference between two groups of smokers is of a small
magnitude.

a.indd 51 27-Aug-19 15:04:00


52

52 The basics
Stata-​output 3.5 Regression of hours of exercise per week (phys_​act_​h) on smoking
sometimes (some_​smoke) and smoking daily (daily_​smoke) and
commands for creating these two latter dummy variables.

. gen some_smoke = smoke

. recode some_smoke 1=1 else=0


(some_smoke: 62 changes made)

. gen daily_smoke = smoke

. recode daily_smoke 2=1 else=0


(daily_smoke: 130 changes made)

. regress phys_act_h some_smoke daily_smoke

Source | SS df MS Number of obs = 422


-------------+---------------------------------- F(2, 419) = 7.34
Model | 196.86818 2 98.43409 Prob > F = 0.0007
Residual | 5620.38774 419 13.4138132 R-squared = 0.0338
-------------+---------------------------------- Adj R-squared = 0.0292
Total | 5817.25592 421 13.81771 Root MSE = 3.6625

------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------

some_smoke | -1.330681 .4931528 -2.70 0.007 -2.300043 -.3613191


daily_smoke | -1.620885 .5121421 -3.16 0.002 -2.627573 -.6141969
_cons | 4.04024 .2143309 18.85 0.000 3.618942 4.461537
------------------------------------------------------------------------------

The procedure in Stata-​output 3.5 is the general approach for examining


if and how an independent variable with > 2 categorical values might be
associated with a dependent variable. The key is to make one less dummy
than there are categorical values for the independent variable in the data. In
this case there were three values (i.e. three smoking statuses); hence the need
for two dummies. We could have made dummies for never smoke and some-
times smoke instead. If so, the daily smoker category would have been the
reference group (the constant), and the two regression coefficients would have
had positive signs.
R-​output 3.2 shows the above procedure in R. Note again that R calls the
constant (_​cons in Stata) the Intercept; other than this the regression
coefficients are similar (–​1.33 and –​1.62). Note also that the code for creating
dummies in R (line 10 and 11) is a bit different from Stata’s.

a.indd 52 27-Aug-19 15:04:00


53

Multiple regression 53
R-​output 3.2 R command-​file generating the statistical output of Stata-​output 3.5.

a.indd 53 27-Aug-19 15:04:01


54

54 The basics
Stata has a faster and less error-​prone way of doing the regression in Stata-​
output 3.5. This is shown in Stata-​output 3.6, where the command prefix “i.”
by default tells Stata to use the lowest value of the independent variable of
interest (here: never smoke; coded 0) as the reference in a multiple regression.
(This may be altered; see the help menu.)

Stata-​output 3.6 Regression of hours of exercise per week (phys_​act_​h) on smoking


status (smoke) using the “i.” prefix to create and use dummies.

. regress phys_act_h i.smoke

Source | SS df MS Number of obs = 422


-------------+---------------------------------- F(2, 419) = 7.34
Model | 196.86818 2 98.43409 Prob > F = 0.0007
Residual | 5620.38774 419 13.4138132 R-squared = 0.0338
-------------+---------------------------------- Adj R-squared = 0.0292
Total | 5817.25592 421 13.81771 Root MSE = 3.6625

------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoke |
sometimes | -1.330681 .4931528 -2.70 0.007 -2.300043 -.3613191
daily | -1.620885 .5121421 -3.16 0.002 -2.627573 -.6141969
|
_cons | 4.04024 .2143309 18.85 0.000 3.618942 4.461537
------------------------------------------------------------------------------

Many real-​life multiple regressions include both sets of dummies (i.e. cat-
egorical variables) and numerical variables at the same time. This does not
complicate matters; the only thing to remember is the ceteris paribus or all-​
else-​equal interpretation of the coefficients in the regression.

3.5 A short summary and the regression model’s explanatory


power (R2)
It is time to take stock of what we have learned and to introduce a new topic.
Regression is about finding out how variation (or change) in a set of inde-
pendent variables, labelled x1 to xn, is statistically associated with the variation
(or change) in a dependent variable/​y. That’s all there is. Regression does not
tell why y changes when x1 or x2 and so on change values (that’s what theory is
for!), but it tells you if it does and by how much. This “if and how much”-​figure
is captured by the size of the regression coefficient. The y of interest is typically
numerical, i.e. some trait measured on a range from small to large numbers,9
whereas the independent variables may be numerical or categorical in nature.
The next example brings together much of what we have learned about
multiple regression. We continue with the data analyzed in section 3.4 and
introduce some new independent variables. The dependent variable, y, is still
students’ weekly hours of exercise. The first independent variable is the sex
of the student (gender), where 1 denotes female and 0 denotes male. The
second independent variable is membership in a fitness center (fitness_​ c),

a.indd 54 27-Aug-19 15:04:02


55

Multiple regression 55
where membership is coded 1 and non-​membership is coded 0. The third inde-
pendent variable is the familiar smoking variable (smoke) from section 3.4.
The fourth independent variable, an ordinal variable, reads “The monetary
costs involved in exercising put restrictions on my participation” (applies very
poorly = 0, applies poorly = 1, neither/​nor = 2, applies well = 3 and applies
very well = 4). The mean of this 0–​4 scale, labelled cost in the data, is 1.53,
where larger values imply more severe subjective cost restrictions. Finally, the
age of the student (age) is the last independent variable (mean = 22.34 years).
The multiple regression appears in Stata-​output 3.7. (Since there is nothing
new in terms of coding in R compared with the previous R-​examples, the
similar R-​commands and output are dropped.)

Stata-​output 3.7 Regression of hours of exercise per week (phys_​act_​h) on gender,


fitness center membership (fitness_​ c), smoking status (smoke),
monetary cost restrictions (cost) and age.

. regress phys_act_h gender fitness_c i.smoke cost age

Source | SS df MS Number of obs = 422


-------------+---------------------------------- F(6, 415) = 12.30
Model | 878.496121 6 146.41602 Prob > F = 0.0000
Residual | 4938.7598 415 11.900626 R-squared = 0.1510
-------------+---------------------------------- Adj R-squared = 0.1387
Total | 5817.25592 421 13.81771 Root MSE = 3.4497

------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | -1.422866 .3653636 -3.89 0.000 -2.14106 -.7046719
fitness_c | 1.555477 .3485678 4.46 0.000 .8702987 2.240656
|
smoke |
sometimes | -1.292541 .4659708 -2.77 0.006 -2.208498 -.3765833
daily | -.9184787 .4975576 -1.85 0.066 -1.896526 .0595687
|
cost | -.4181837 .1189421 -3.52 0.000 -.6519878 -.1843795
age | -.034727 .0387085 -0.90 0.370 -.1108163 .0413622
_cons | 5.670714 .9582337 5.92 0.000 3.787117 7.55431
------------------------------------------------------------------------------

Adjusting for controls, female students exercise for 1.42 less hours than
male students on a weekly basis. That is, female students “similar” to male
ones with respect to fitness center membership, smoking status, subjective
cost restrictions and age exercise for almost 1.5 fewer hours per week than
their male counterparts. The constant of 5.67 has no real-​life meaning, since
none of the students have the value of zero on all independent variables in
the regression. Analogously (i.e. all-​else-​equal), fitness center members exer-
cise for 1.56 more hours than non-​members. Both groups of smokers exer-
cise fewer hours than non-​smokers; the ceteris paribus differences being 1.29
hours between non-​smokers and sometimes-​smokers and 0.92 hours between
non-​smokers and daily-​smokers. The cost variable has a regression coefficient
of about –​0.42. Interpreted dynamically it suggests that a one-​unit increase
in this variable, say from applies well (3) to applies very well (4), is associated

a.indd 55 27-Aug-19 15:04:03


56

56 The basics
with a 0.42-​hour decrease in exercise hours. More statically and correct we
have that a student answering “applies very well” (4) on average exercises for
1.68 fewer hours per week than a student answering “applies very poorly”
(0) (4 × –​0.42 = –​1.68). That is, those facing financial constraints on exer-
cising exercise less than those who do not. Age has a tiny negative effect
(–​0.03), suggesting that older students exercise slightly less per week than
younger ones. But since this coefficient is close to zero (implying no effect
of age on hours of exercise), we pay no attention to it and conclude that age
does not appear to be associated with hours of exercise. We will later learn,
in Chapter 4, why this is justified.
A regression model’s explanatory power –​ the R2. So far we have been
interested in the individual coefficients (effects) of each independent vari-
able in the regression. We now turn attention to the regression model per se.
More specifically, we would like to know how “good” our regression model
is in terms of explanatory power. In regression-​speak this boils down to
how much of the variation (strictly speaking, variance) in y the x-​variables
combined account for. But first, what do we mean when we say regression
model? Simply put, the regression model in Stata-​output 3.7 may be written
in algebraic form as

y = b0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6 + e,

where y = hours of exercise per week, b0 is the constant, the different bs are
the various regression coefficients (effects) to be estimated, and
x1 = gender,
x2 = fitness center membership,
x3 = smoking sometimes,
x4 = smoking daily,
x5 = monetary cost, and
x6 = age.
Visually this regression may be depicted as shown in Figure 3.3. Whether
expressed in algebraic or visual form, our regression model suggests that (the
amount of) exercise hours is an additive function of six independent variables.
The “goodness” of this model is expressed in terms of how much of the vari-
ation in hours of exercise the six independent variables explain altogether.
The so-​called R2 is the measure of interest in this regard.
To exemplify how the R2 is calculated, take a look at the SS column in
Stata-​output 3.7. The term SS is short for sum of squares, which in simplified
terms we may think of as a measure of the variation in the data. In the third
row of this column (Total) we have the total sum of squares (TSS); that is
5 817.26. TSS tells us how much variation (variance) there is in y altogether in
the data. In the first row (Model) we have the model sum of squares (MSS);
that is 878.50. MSS denotes how much of the variation in y is attributable to
the independent variable in the model. If we divide MSS by TSS we get the R2.

a.indd 56 27-Aug-19 15:04:03


57

Multiple regression 57

Gender

Fitness center
membership

Smoking somemes
Hours of
Smoking daily exercise

Monetary cost

Age

Figure 3.3 Visual model of the regression estimated in Stata-​output 3.7.


Correlations among the independent variables as well as the error term, e, are omitted
for the sake of brevity.

In our model the R2 is 0.151 or 15.1% (878.50/​5 817.26 = 0.151). The six
independent variables in the regression model explain about 15% of the vari-
ation (variance) in the dependent variable. In other words, 85% of the vari-
ation in hours of exercise stems from independent variables not included in
the regression model and/​or randomness. We find the R2 on the right-​hand
side of Stata-​output 3.7, denoted R-​squared. Every statistics program
doing regression reports this measure. In the program R, the R2 is reported at
the bottom of the output as Multiple R-​squared (see e.g. R-​output 3.2).10
Students understandably want to know if there is a threshold value for R2
suggestive of a “good” model. Unfortunately, there is no such thing. Whether
our model of exercise hours is acceptable or not must be judged against a yard-
stick. If some researchers have earlier presented a similar model of exercise
hours explaining 25% of the variation, we might have a problem in the sense
that our model may lack important (control) variables. If these researchers’
R2 was only 10%, in contrast, our model looks better on the face of it.11 That
said, few would argue against the idea that a model explaining much of the
variation in y is preferable to a model explaining very little. Finally, it should
be noted that analyses of questionnaire data very seldom yield an R2 larger
than, say, 50%, whereas R2 scores between 10 and 20% are very common
indeed.12
A new case: The price of cognac. Below we examine how the price of cognac is
associated with certain key attributes of this liquor. To address this issue, we
use a data set on 52 bottles of cognac (cognac_​price.dta). Descriptive
statistics for the variables in the data appear in Stata-​output 3.8.

a.indd 57 27-Aug-19 15:04:06


58

58 The basics
Stata-​output 3.8 Descriptive statistics for the variables price, quality and type.

. use "C:\Users\700245\Desktop\temp_myd\cognac_price.dta", clear

. sum price quality

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
price | 52 399.8077 82.72137 230 620
quality | 52 82.51923 1.873346 77 87

. tab type

type | Freq. Percent Cum.


------------+-----------------------------------
VS | 18 34.62 34.62
VSOP | 13 25.00 59.62
XO | 21 40.38 100.00
------------+-----------------------------------
Total | 52 100.00

We note that a bottle of cognac costs NOK 400 on average. Furthermore,


on the quality scale from 77 to 87 points (larger scores = better quality), the
average is 82.5. Finally, there are three types (or categories) of cognac in the
data: VS, VSOP and XO. Our first research question is to examine how quality
is associated with price. Stata-​output 3.9 answers this question.

Stata-​output 3.9 Regression of price of cognac (price) on quality of cognac (quality).

. reg price quality

Source | SS df MS Number of obs = 52


-------------+---------------------------------- F(1, 50) = 39.12
Model | 153188.021 1 153188.021 Prob > F = 0.0000
Residual | 195796.056 50 3915.92111 R-squared = 0.4390
-------------+---------------------------------- Adj R-squared = 0.4277
Total | 348984.077 51 6842.82504 Root MSE = 62.577

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
quality | 29.25561 4.6775 6.25 0.000 19.86058 38.65065
_cons | -2014.343 386.0812 -5.22 0.000 -2789.81 -1238.876
------------------------------------------------------------------------------

Quality unsurprisingly has a positive effect on price; a one-​point increase


in quality is suggestive of a NOK 29 more expensive cognac, dynamically
speaking. More correct, a cognac getting a quality score of 80 costs NOK
29 more on average than a cognac obtaining a score of 79. Quality accounts
for 44% of the variation (variance) in price; the R2 is 0.439. Again we need a
benchmark to evaluate if this R2 score is high or low or somewhere midway
between.

a.indd 58 27-Aug-19 15:04:07


59

Multiple regression 59
Stata-​output 3.10 scrutinizes the relationship between cognac type and
price. Note that VSOP costs about NOK 55 more than VS (the reference) on
average, and that XO similarly costs about NOK 139 more than VS. The price
of VS is approximately NOK 330, as indicated by the constant. (Take a look
at section 3.4 again if these interpretations are unfamiliar.) The variable type
accounts for 55% of the variation in price.

Stata-​output 3.10 Regression of price of cognac (price) on type of cognac (type).

. reg price i.type

Source | SS df MS Number of obs = 52


-------------+---------------------------------- F(2, 49) = 29.95
Model | 191950.651 2 95975.3254 Prob > F = 0.0000
Residual | 157033.426 49 3204.7638 R-squared = 0.5500
-------------+---------------------------------- Adj R-squared = 0.5317
Total | 348984.077 51 6842.82504 Root MSE = 56.611

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
type |
VSOP | 54.7265 20.60492 2.66 0.011 13.31936 96.13363
XO | 139.254 18.18379 7.66 0.000 102.7123 175.7957
|
_cons | 329.8889 13.34325 24.72 0.000 303.0746 356.7032
------------------------------------------------------------------------------

In Stata-​output 3.11, both cognac quality and type are independent


variables in the regression. Two features are to be noted. First, the effect of
quality gets reduced in magnitude by more than half when controlled for type
of cognac (from 29 to 13.5). Similarly, the two effects of cognac type are some-
what reduced when controlled for quality. Second, whereas the two bivariate
models (Stata-​outputs 3.9 and 3.10) yield R2 scores of, respectively, 44% and
55%, the multiple model accounts for 60% (Stata-​output 3.11). Why does the
latter model not account for 99% of the variation in price (44 + 55 = 99)?
The answer is that quality and type are correlated, which also explains why
their individual effects are weaker in the multiple regression than in the
bivariate ones. The Venn diagram in Figure 3.4 illustrates. Figuratively we
might say the R2 of the regression with quality and type (60%) is obtained by
adding 44 and 55 and “subtracting the correlation” between these two inde-
pendent variables, as indicated by the overlap between the two lower circles
representing them. The same reasoning extends to cases with more than two
independent variables.

a.indd 59 27-Aug-19 15:04:08


60

Stata-​output 3.11 Regression of price of cognac (price) on quality of cognac


(quality) and type of cognac (type).

. reg price quality i.type

Source | SS df MS Number of obs = 52


-------------+---------------------------------- F(3, 48) = 24.28
Model | 210373.694 3 70124.5647 Prob > F = 0.0000
Residual | 138610.383 48 2887.71631 R-squared = 0.6028
-------------+---------------------------------- Adj R-squared = 0.5780
Total | 348984.077 51 6842.82504 Root MSE = 53.737

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
quality | 13.54774 5.363688 2.53 0.015 2.763325 24.33216
|
type |
VSOP | 42.39458 20.15931 2.10 0.041 1.861563 82.92759
XO | 101.5138 22.82966 4.45 0.000 55.61172 147.4159
|
_cons | -769.7362 435.5369 -1.77 0.084 -1645.442 105.9694
------------------------------------------------------------------------------

Figure 3.4 The overlap between the two lowers circles (in black shading) denotes the
correlation between cognac quality (x1) and cognac type (x2). The non-​
overlapping parts of these circles (in gray shading) express the regression
coefficients/​bs (effects) that are unique to cognac quality (x1) and cognac
type (x2) in explaining variation in price (y); see also section 3.2.

a.indd 60 27-Aug-19 15:04:09


61

Multiple regression 61

3.6 Summary and further reading


In many applications in the social sciences we want to know as exactly as pos-
sible to what extent an independent variable, x1, is associated with a dependent
variable, y. To that end we use multiple regression. The major forte of mul-
tiple regression is the ability to adjust for (or control for or hold constant) the
effects of other independent variables (i.e. relevant controls) in such a way
that we are able to gauge the unique (or ceteris paribus) effect of the inde-
pendent variable of key interest, x1. A prerequisite for this endeavor’s likeli-
hood to succeed is that we have access to all relevant control variables in our
data –​which in most applied settings is a tall order indeed.
Sometimes we are interested in to what extent all the independent variables
combined are able to explain variation (variance) in the dependent variable.
R2 measures this capability and its theoretical values range from 0 to 1 –​or
from 0 to 100%. Without some yardstick, often found in prior research on
the topic in question, it is difficult to assess whether a given R2 value is non-​
satisfactory, adequate or very satisfactory.
The literature mentioned in section 2.8 is of course equally relevant for
this chapter; see in particular Allison (1999; Chapters 1, 2 and 4), Berk (2004;
Chapters 5, 6 and 7) and Wooldridge (2000; Chapter 3). In addition, I strongly
recommend Pearl and Mackenzie (2018) to get a deeper understanding of
confounding and the need to control for “lurking variables” in statistical analysis.

Notes
1 I also discuss experimental data in Chapters 9 and 10.
2 More on this problem, i.e. the problem of obtaining so-​called causal effects, in
Chapters 9 and 10 and sections 11.4 and 11.6.
3 The principle simultaneously also works the other way around: Fitness center
membership is controlled for exercise interest, and exercise interest is controlled
for fitness center membership.
4 In some cases the regression coefficient of interest might increase in size when
controls are added to the model. This is called a suppressor effect; see DeMaris
(2004) for more on this.
5 That is, “relevant” meaning (control) variables affecting y and being correlated
with x1. More on this in section 5.2, i.e. the regression assumption that all relevant
independent variables should be included in the regression model (i.e. no omitted
independent variables).
6 Twenty observations per independent variable in the model should be adequate in
most cases (see Allison, 1999).
7 The group comparison approach is chosen for ease of exposition. The inde-
pendent variables of main interest do not have to be dummies; they could instead
be numerical or ordinal independent variables.
8 Remember that the constant means where the regression line cuts across the y-​
axis, which is for x = 0. If a student has the value of x = 0 for some_​smoke (i.e. do
not smoke sometimes) and the value of x = 0 for daily_​smoke (i.e. do not smoke

a.indd 61 27-Aug-19 15:04:09


62

62 The basics
daily), he or she must be a non-​smoker (i.e. answer never) since there are no other
smoking statuses available in the questionnaire/​data.
9 In Chapters 7 and 8 I consider regression analysis for non-​numerical dependent
variables.
10 More strictly correct, the sum of squares in question refers to size of the error
term (e) of the regression, which becomes smaller when independent variables are
added to the model (cf. section 2.7). The so-​called adjusted R2 takes into account
that a regression with many independent variables by random chance is more
likely to produce a larger R2 than a regression with fewer independent variables.
The adjusted R2 takes away this advantage.
11 If the research question is confined to finding x1’s unique effect on y, which often
is the case, the role and size of R2 is typically of minor importance in practical
research.
12 The type of data analyzed also matters for the size of R2. Cross-​sectional data typ-
ically yield lower R2 values than time-​series data.

Exercises with data and solutions


Note! The regression models in the exercise sections of this book are
throughout stated in “Stata-​speak”, in keeping with the book’s informal and
practical approach. It deserves mentioning that this way of expressing regres-
sion models is, strictly speaking, mathematically incorrect.

Exercise 3A
Use the housedata.dta. A dummy in this data set is called “sec_​living”
and denotes whether or not a house has a second living room (yes = 1, no = 0).
3A1: Estimate the following regression model:

price_​Euro = sec_​living

What does this bivariate regression tell you?


3A2: Estimate the following regression model:

price_​
Euro = sec_​
living + renovate

What does this multiple regression tell you?


3A3: Estimate the following regression model:

price_​
Euro = sec_​
living + renovate + square_​
m

What does this multiple regression tell you?

a.indd 62 27-Aug-19 15:04:09


63

Multiple regression 63
Exercise 3A, solution
3A1:
The regression coefficient for sec_​living is 107 070, suggesting that
houses with a second living room sell for 107 000 € more on average than
houses without such a room. Houses without a second living room sell for, on
average, 421 254 € (i.e. the constant). The R2 is about 0.09, indicating that the
dummy in question accounts for 9% of the variation (variance) in sale prices.
3A2:
The regression coefficient for sec_​living is 104 267, suggesting that houses
with a second living room sell for 104 000 € more than houses without such a
room, controlling for renovate. Similarly, renovated houses sell for 74 756 €
more than non-​renovated ones, controlling for sec_​living. Since both inde-
pendent variables are dummies (i.e. has the same measurement scale), the results
suggest that a second living room is slightly more important for price variation
than renovation. (The fact that the R2 does not double in size from the bivariate
model is also suggestive of this.) The mean price of a non-​renovated house with
no second living room is 386 679 € (i.e. the constant).
3A3:
The regression coefficient for sec_​living is now 18 225, i.e. miniscule
compared with the models in A1/​A2. This suggests a more or less spurious rela-
tionship between sec_​living and price, caused by the variable square_​m
as explained in section 3.1. The same logic applies for the variable renovate,
but to a smaller extent; the b is now 38 900. The main driver behind variation
in house prices is, as expected, house size (i.e. square_​m). The increase in the
R2 to 0.54 (or 54%) also underscores the latter point.

Exercise 3B
Use the data football_​earnings.dta. The variable games_​int refers
to total number of international matches played, and the variable age is the
age of the player in question. Both variables are expected to affect yearly
earnings in € (earnings_​E).
3B1: Estimate the following regression model:

earnings_​E = games_​int

What does this bivariate regression tell you?


3B2: Estimate the following regression model:

earnings_​
E = games_​
int + age

What does this multiple regression tell you?

a.indd 63 27-Aug-19 15:04:09


64

64 The basics
Exercise 3B, solution
3B1:
The regression coefficient for games_​int is 3 044, indicating that a player
with 20 international matches on average earns about 30 000 € more per year
than a player with 10 international matches (10 × 3 044). The R2 of 0.14
suggests that number of international matches explains 14% of the variation
(variance) in yearly earnings.
3B2:
The regression coefficient for games_​int is 2 557, controlling for player
age (age), indicating that a player with 20 international matches earns about
26 000 € more per year than a player with 10 international matches (10 ×
2 557). 16% of the association between number of international matches and
earnings is therefore spurious and caused by player age ((3 044–​2 557)/​ 3 044
≈ 0.16). The regression coefficient for age is 3 794, implying that a 30-​year-​old
player earns about 19 000 € more per year than a 25-​year-​old ceteris paribus
(5 × 3 794). But does it make sense that earnings keep on increasing as the
players age indefinitely? More on this in sections 6.3, 6.4 and 6.5. Since the
R2 is only 0.18 compared with 0.16 in B1, the regression model suggests that
games_​int is more important for the variation in earnings than age.

Informal glossary to Part 1


All-​else-​equal (or ceteris paribus):
Term denoting that the size of the regression coefficient of main interest, b1,
is controlled or adjusted for the control variables in the regression model.
Note that the “all” clause refers only to the control variables included in the
regression model.
b0:
See constant
b1:
See regression coefficient
Constant:
The point where the regression line crosses the x-​axis, i.e. at x = 0. Note that
the constant only has a real-​life interpretation when 0 is an actual value in
the data for x1 (or xn). Note also that the constant must be in the regression
equation if one wants to make predictions for y based on specific values for
the x-​variables. The constant is also called b0 or the intercept.
Ceteris paribus:
See all-​else-​equal.

a.indd 64 27-Aug-19 15:04:10


65

Multiple regression 65
Confounder:
An independent variable, z, that affects both x1 and y in such a way that x1-​y
relationship is (more or less) false/​spurious.
Data:
Collective term for the units/​observations (see below) for which we have vari-
able information.
Dependent variable (or y):
The variable we want to explain statistical variation in, based on (variation in)
the independent variable(s).
Dummy variable (dummy):
A variable with only two distinct values, coded 0 and 1.
Error term (e):
An unobserved variable (along with randomness) summing up all effects
on the dependent variable other than those stemming from the inde-
pendent variables in the regression model. Generally speaking, we always
assume e = 0.
Independent variable (or x):
The variable we think is (partly) responsible for the variation in the dependent
variable.
Index:
A combination of several variables into one composite variable.
Intercept:
See constant.
Multiple regression (analysis):
A regression analysis containing two or more independent variables.
Observations:
See units.
Regression (analysis):
A statistical method associating an independent variable, x, to a dependent
variable, y. The aim is to quantify if and to what extent x is associated with y.
Note that a regression might include several independent variables simultan-
eously; see Multiple regression above.
Regression coefficient:
A figure describing to what extent x is associated with y in the data, i.e. the
change in y associated with a one-​unit increase in x. Also called b1 or slope.
Residual:
The vertical distance between a data point and the regression line in the data.

a.indd 65 27-Aug-19 15:04:10


66

66 The basics
Respondents:
Persons who have answered a questionnaire, and as such often the units (see
below) up for analysis in a regression.
R2:
A figure denoting how much of the variation (variance) in the dependent vari-
able that is brought about by all the independent variables in the regression
combined.
Slope:
See regression coefficient.
Subjects:
Persons as in units (see below) up for analysis in a regression, but not neces-
sarily respondents answering a questionnaire.
Units:
The entities we have variable information (or data) on; the input for calcu-
lating b0 and b1 (see above). Often persons or firms in the social sciences. Also
called observations in many cases.
Variable:
Something that varies among units/​observations (or with time).
x:
See independent variable.
y:
See dependent variable.

a.indd 66 27-Aug-19 15:04:10


67

Part 2

The foundations

In Part 2, “The foundations”, you will get to know:

• The difference between sample values (estimates) and population values


(parameters)
• How to find out if a regression coefficient in a sample is “large enough” to
warrant attention in a larger (but unobserved) population
• How to do and interpret tests of significance for regression analysis
• How to avoid judging significance values “mechanistically”
• The assumptions of regression analysis, and what they actually mean in
practice
• How to evaluate if your regression model meets the assumptions of
regression analysis

a.indd 67 27-Aug-19 15:04:10


68

a.indd 68 27-Aug-19 15:04:10


69

4 
Samples and populations, random
variability and testing of statistical
significance

4.1 Introduction
You have probably heard of the terms “sample” and “population” before.
You might also have a good grasp of what these concepts mean. If not, don’t
worry, as they will be explained below. Up until now I have used the word
“data” exclusively to emphasize that our regression coefficients, or bs, refer to
the data at hand. That is, no reference has been made to what these coefficients
might be “outside” of the data. This chapter is about making such inferences.
There are, to simplify a bit, two kinds of samples: representative samples
and non-​representative samples. The former will concern us in this chapter
(but see section 4.6). Representative samples are, as the name implies, intended
to be representative of some larger population. The traits of interest in this
population are most often unknown to us, and we thus use a representative
sample to tell us about these traits. The aim of this introduction is to obtain a
rudimentary first grasp of this process.
You have likely answered a political poll or survey on consumer habits or
some related issues. Often 1 000 people (respondents) or more answer such
surveys, and they are referred to as “the sample”. Yet when researchers pre-
sent results of such surveys, they make claims regarding “a population” pos-
sibly containing millions of people. In other words, they find out stuff about
a small sample but draw conclusions about a large population. How come?
The answer is that before the researchers collected their sample, they took
precautions to ensure that it would end up as representative of the larger
population. If they succeeded, which they most often do, they are entitled to
knowledge of a whole population even though they only have surveyed a small
fraction of this population. This is ingenious in terms of effort and costs!
In a regression we find the b1 for the relationship between x and y in the
data, which we now call a sample.1 (We omit control variables for simplicity.) If
this sample is representative of a population, and if the b1 of interest is clearly
larger (±) than zero, we may in general extrapolate this sample-​specific b1 to
the population and claim that it is large enough to warrant attention from a
scholarly or practical point of view. That is, we claim x1 to be associated with
y in the population although we only have examined data from a small sample
of this population. Why we are able to do so is what Chapter 4 explains in a
rough and step-​by-​step fashion.

a.indd 69 27-Aug-19 15:04:10


70

70 The foundations

4.2 Populations, samples and representative samples: The magic of


random sampling
A population is all the units of something. All soccer players in the Premier
League (PL) is therefore a population as are all those presently incarcerated in
U.S. prisons or all adults in England. A sample, in contrast, is some fraction
of a population, for example the players of Manchester City, the prisoners
in Leavenworth and the adults of London. But these three samples are not
necessarily representative of their populations. How come? Using the players
of Manchester City (MC) as an example, these could be younger or older,
taller or smaller or better-​performing or lower-​performing than the rest of the
players in the PL. (The same reasoning applies for the other two examples.)
This means that if we were to use the information on the MC players to make
inferences about the PL, we may commit small or large errors depending on
the extent to which the MC players differ from the rest of the PL players
for the variables mentioned. That is, when a sample is not representative we
cannot make inferences (or generalizations) to the population of interest.
How do we obtain a representative sample? The answer lies in the concept of
random sampling.
Random sampling. The idea of random sampling is simple. Think of a lottery
where each number is equally likely to be drawn from a bucket. Or imagine
writing the name of each player in the PL on a ticket, say 400 tickets (i.e. the
population). Put all the tickets in a box, shake it and draw 50 tickets. (Also,
between every draw, shake the box.) There you go –​you have a random
sample of 50 players from the PL. The idea is that all members of the popu-
lation have the same, and known, probability of being sampled. Well, at
least almost. Each player in this case has a 1/​400 chance of being drawn as
the first player; in the second draw this chance becomes slightly lower and
so on for the third draw etc. In practice, though, these selection probabilities
are equal enough. The “magic” lies in that the randomness of the procedure
ensures that the resulting sample becomes a representative sample.2
For simplicity, let’s start with the mean of a variable and not a regres-
sion coefficient. Say the mean age in the population of players in the PL
is 24 years. This figure is a parameter we normally do not know the value
of. Let’s also assume that in our random sample of 50 players from this
population, the corresponding sample mean is 23 years. This figure is
called an estimate (or, strictly speaking, a statistic providing an estimate
of the unknown parameter). Why are these means not exactly similar? The
answer is random variability. A thought experiment clarifies this. Imagine
putting the 50 tickets back in the box, shaking it and making 50 new draws.
Now let’s say the mean of age in this second sample is 25 years; a slightly
different mean than for the first sample (and for the population), but not
by much. Back with the 50 tickets, shake and repeat the process 1 000 times
or so. For each new sample, compute the mean of age and write down

a.indd 70 27-Aug-19 15:04:10


71

Samples, variability and significance 71


the result: 22.8, 25.3, 23.5, 24.6, 25.0… and so on. In other words, there
are slightly different mean values in each sample time due to chance vari-
ability. Now, thanks to statisticians, a fascinating feature is that we know
a priori (i.e. before doing any statistical analysis of our own) what this
distribution of mean values looks like and what characterizes it. That is,
we know that this sequence of sample means, if we actually were to carry
out this repeated sampling thought experiment, would be roughly normally
distributed.3

4.3 The normal distribution and repeated sampling


The normal distribution. The normal distribution is a key concept in statistics.
The following example illustrates. We have a random sample of 325 apartments
(price_​sq_​m.dta) and two variables: sale price in euros (price_​e) and
square meter (sq_​m). The latter variable, apartment size, is what interests us.
The mean of this variable is 64.92 m2 (SD = 23.52) and its frequency distribu-
tion is depicted graphically in the histogram in Figure 4.1.
The imposed line in Figure 4.1 is a perfect normal distribution. The vari-
able square meter is therefore not perfectly normally distributed, but it is close
enough for our purposes. Note that the top point of the curve tallies with the
mean of the variable (64.92). Also, the apartments lie in the interval from
50
40 30
Frequency

Mean = 64.92
20 10
0

0 50 100 150 200


sq_m

Figure 4.1 Histogram of variable square meter (sq_​m), with normal curve imposed.

a.indd 71 27-Aug-19 15:04:11


72

72 The foundations

z-score

–3.5 –3 –2.5 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2 2.5 3 3.5

68%

95%

Figure 4.2 Percentage of observations under the normal curve for a standardized


variable.

approximately 20 to 170 m2 with a concentration in the 50 to 90 m2 region (i.e.


the tallest bars). But the vital thing to note for a normally distributed variable
is that 95% of its observations fall in the interval mean ± approximately two
standard deviations. In other words, 95% of the sample apartments are in the
region of 64.92 m2 ± (2 × 23.52). For purposes we get back to in section 4.5,
however, we often relate the normal distribution to a standardized variable,
i.e. a variable with a mean of 0 and a standard deviation of 1. This standard
deviation is referred to as z. For a standardized variable, thus, 95% of the dis-
tribution is to be found in the region 0 ± 2 z. Figure 4.2 illustrates.
In practice, and as with the apartment data, we only have one sample.
But as in the soccer player thought experiment, we might imagine drawing
repeated samples of 325 apartments from the population of apartments. For
each sample we calculate the mean of m2 and write it down: 62 m2, 68 m2,
69 m2, 63 m2… and so on. What will the distribution of these (imagined)
means from a repeated sampling process look like? You guessed correctly: A
normal distribution.
Statisticians have also unearthed another key insight in this regard. Based
on information in our one sample of apartments, we may calculate an interval
for the mean of square meters in the unknown population of interest. This
interval is called the 95% confidence interval.

4.4 Confidence intervals for means and regression coefficients


The 95% confidence interval for a mean tells us that the mean of a variable will
lie in a specific region in the population with 95% certainty or confidence. The

a.indd 72 27-Aug-19 15:04:11


73

Samples, variability and significance 73


formula to obtain this interval resembles the one we have seen so far: sample
mean ± 2 standard errors.4
What is a standard error (SE)? Recall the standard deviation measuring
the spread or variation for an observed variable, such as the age of PL
players or the size of apartments. The SE, in contrast, refers to an imagined
process of repeated sampling. That is, the SE measures the spread for the
(imagined) variable resulting from a thought experiment in which we draw
repeated samples from a population, calculate the mean of the variable of
interest for each new sample and find the “standard deviation” for this (long
and imagined) list of mean values. Luckily, however, in the case of the mean,
we may use information in our sample to estimate this SE. The formula
is simply

s
SE =
n′

where s is the standard deviation for the variable of interest, and n is the
number of observations in the sample. For the variable square meter (sq_​m)
in the apartment data, these figures are 23.52 and 325 (see section 4.2). The
standard error (SE) thus becomes

23.52
≈ 1.305.
325

Now we have what we need to compute the 95% confidence interval for the
mean of square meter (sq_​m) in the population of interest, which is

64.92 ± 2 × 1.305 → 64.92 ± 2.61 → [62.31, 67.63].

The mean of square meter in the population of apartments lies in the region
between 62.31 and 67.63 m2 with 95% confidence.
Main point: This example shows how to make a rather precise description of
a population variable (here: a mean) based on only the sample equivalent of
this description. But recall that this requires that the sample in question is rep-
resentative of the population which, in turn, assumes that the sample has been
drawn according to some kind of random procedure (see note 2).
The 95% confidence interval for a regression coefficient. By the same logic as in
getting from an exact sample mean to a more imprecise population mean, we
do the equivalent for a sample regression coefficient. The key is first to find the
standard error (SE) of this regression coefficient. We use the apartment data
to illustrate; Stata-​output 4.1 reports the regression between the variables
square meter (sq_​m) and sale price in euros (price_​e).

a.indd 73 27-Aug-19 15:04:13


74

74 The foundations
Stata-​output 4.1 Regression of apartment sale prices (price_​e) on apartment sizes
(sq_​m).

. use "C:\Users\700245\Desktop\temp_myd\price_sq_m.dta"

. reg price_e sq_m

Source | SS df MS Number of obs = 325


-------------+---------------------------------- F(1, 323) = 134.25
Model | 2.0751e+12 1 2.0751e+12 Prob > F = 0.0000
Residual | 4.9926e+12 323 1.5457e+10 R-squared = 0.2936
-------------+---------------------------------- Adj R-squared = 0.2914
Total | 7.0677e+12 324 2.1814e+10 Root MSE = 1.2e+05

------------------------------------------------------------------------------
price_e | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sq_m | 3402.354 293.6466 11.59 0.000 2824.652 3980.055
_cons | 91461.22 20273.53 4.51 0.000 51576.37 131346.1
------------------------------------------------------------------------------

The output contains the information we need to compute the 95% con-
fidence interval for the regression coefficient in the population, since the
formula is

Sample regression coefficient/​b ± 2 standard errors.

The sample regression coefficient/​b is 3 402, saying that an 80 m2 apartment


costs 3 402 € more on average than a 79 m2 apartment. The estimated SE
(under the heading “Std. Err.”; black arrow) is 293.65.5 With this we cal-
culate the 95% confidence interval to be

3 402 ± 2 × 293.65 → 3 402 ± 587.3 → [2 814.7, 3 989.3].

The 95% confidence interval for the square meter coefficient in the popula-
tion of apartments is roughly between 2 800 and 4 000 €.6 In the R program
the SE is reported on the right-​hand side of the regression coefficient; see e.g.
R-​output 2.1.

4.5 Testing for the significance of a regression coefficient: Putting


the confidence interval on its head
The confidence interval for the square meter regression coefficient does not
contain the value of zero. This means we can be certain that this coeffi-
cient is larger than 0 in the population of interest. But this is not how we
judge whether or not a regression coefficient is “large enough” to warrant
attention in an unknown population. Instead, we use tests of significance
to shed light on this issue. Before exemplifying this procedure, however, we
need to make a detour in the realm of hypotheses generation and hypoth-
esis testing.

a.indd 74 27-Aug-19 15:04:13


75

Samples, variability and significance 75


The null hypothesis (H0) and the alternative hypothesis (H1 or Ha). In stat-
istical research there are always two hypotheses of interest: the null (H0) and
the alternative (H1). The alternative hypothesis expresses what we believe a
priori (i.e. before doing the analysis). The null states the “opposite” of the
alternative hypothesis, meaning, a bit simplified, what we don’t believe in. To
make this concrete, consider the regression between square meter and sale
price above. Here, our alternative hypothesis would be that (a) the size of
apartments should affect the sale prices of apartments, or that (b) the size of
apartments should affect the sale prices of apartments positively.7 Whether or
not we choose (a) or (b) as H1, H0 says that square meter should be statistically
unrelated to, or have no correlation with, sale price. That is, the null suggests
that the regression coefficient for square meter is zero. In short: H0 → b = 0.
We always test the null hypothesis in statistical research, for reasons we do
not go into.8 We test H0 against the data, and if what the results show appear
unlikely (to be defined below) compared to what H0 suggests, we reject it. That
is, if H0 is rejected, we get indirect support for H1 –​what we believe in. In con-
trast, if we cannot reject H0, we keep it and conclude that the results do not
support what we believe in, i.e. H1.
The region of acceptance for H0. To examine whether to reject H0 or not, we
construct a 95% interval for what it suggests. Turning the logic of the 95%
confidence interval on its head, this interval is given by the formula
0 ± 2 standard errors,
since H0 states that b = 0.
Based on the information in Stata-​output 4.1 this yields

0 ± 2 × 293.65 → 0 ± 587.3 → [–​587.3, 587.3].

The region of acceptance for H0 is from about –​600 to 600. Since our sample
regression coefficient is 3 402 –​i.e. clearly on the outside of this region –​we
reject H0 and gain indirect support for the H1 that square meter has a statis-
tical association/​relationship with sale price.
I mentioned that the term “unlikely” needs a precise definition. In statistics,
something is unlikely if it happens 5 times or less out of 100 in repeated sam-
pling. Conversely, it is likely to happen if it turns up more than 95 times out
of 100 in repeated sampling. Informally, we say that something is unlikely if it
has a 5% probability or less to occur. In this sense, getting a sample regression
coefficient of 3 402 like we did above if the true coefficient in the population
is zero, is unlikely. Hence, we reject what is unlikely, i.e. H0, and get indirect
support for H1. In practice, thankfully, testing of hypotheses in statistical
research is less cumbersome than the above procedure. Without further ado,
please welcome significance testing!
Significance testing and t-​values. Above we claimed to have gotten support,
albeit indirect, for our H1 –​that there was an association between square
meter and sale price in the population. We could instead have said that our

a.indd 75 27-Aug-19 15:04:13


76

76 The foundations
regression coefficient is “statistically significant”, which means the same thing.
Indeed, this is the typical way of saying that a regression coefficient is “large
enough” to be present in the (unknown) population and perhaps to deserve
attention in a scholarly project.
In practice we divide the sample regression coefficient, b, by its standard
error (SEb) to find out if b is statistically significant. From Stata-​output 4.1
this yields 3 402.354/​293.647 ≈ 11.59. The result is called a t-​value, and all
statistics programs calculate it. Stata reports it under the heading “t”; in R it
is labelled “t value”. If the t-​value is larger than ± 2, implying that b lies
outside of the region of acceptance for H0, we deem b to be “statistically sig-
nificant”.9 Since the present t-​value is 11.59, our b is statistically significant.
That is, we judge b to be larger than zero in the population in the same way as
we did when we calculated the region of acceptance for H0 and found that b
was on the outside of this region.
Sometimes the t-​value of b becomes smaller than 2, indicating that b lies
within the acceptance region of H0. If so, we dare not reject H0, and we thus
keep it. We then normally conclude that the regression coefficient is not “large
enough” to warrant attention in the population.10 Or, as shorthand, we say
that b is “not statistically significant”. In Stata-​output 3.7 we saw that age
had a very tiny effect on hours of exercise; the b being –​0.03 and the t-​value
being –​0.90. Since this t-​value is below –​1.96, we conclude that the effect of
age is not significant (or insignificant).
Significance levels and t-​values. So far the presentation has implied that we
discard H0 if it has a 5% or smaller probability of occurring in repeated sam-
pling. In large samples (i.e. n > 120) this tallies with a t-​value of 2. Or 1.96
to be exact. That is, a t-​value of 1.96 parallels a 5% significance level for b
in large samples. This 5% level is the default in the social sciences. Another
threshold is the 1% significance level. This corresponds to a t-​value of 2.6
in large samples. (The relationship between t-​values and significance levels is
inverse: The larger the t-​value, the lower the significance level. The larger the
t-​value, the less likely it is that H0 occurs by chance in repeated sampling.)
What, then, is a t-​value? For large samples t-​values are similar to the z-​values
we saw in Figure 4.2. These t-​values, to simplify, measure the distances in the
region of acceptance for H0 as depicted in Figure 4.3. The figure shows that
our t-​value of 11.59 is outside the 0 ± 2 t region, which is the 5% significance
level. It is also outside the 0 ± 2.6 t region for the 1% significance level. In other
words, our regression coefficient/​b is also statistically significant at the 1% level.
In small samples, t-​values must be larger than 1.96 to be deemed significant
at the 5% level. For, say, a x-​y regression on 40 observations, the t-​value must
be 2.021 in order to reach the 5% level. Every book on statistics has a table
showing these t-​value/​significance level associations for various sample sizes.
Significance probability or p. On the right-​hand side of “t” in Stata-​output
4.1 we see the heading “P>|t|”. This column denotes the exact significance

a.indd 76 27-Aug-19 15:04:13


77

Samples, variability and significance 77

t = 11,6

t -values

–3.5 –3 –2.5 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2 2.5 3 3.5

Region of acceptance of H 0 (5% significance)

Region of acceptance of H 0 (1% significance)

Figure 4.3 Region of acceptance for the null hypothesis and its relationship with
t-​values.

probability (p for short), and not only if a regression coefficient/​b is significant


below some threshold value like 5 or 1%. In our case this p is 0.000, which
means p < 0.0001. The probability that what H0 suggests occurs by chance in
repeated sampling is less than 0.01% (0.01 = 1%).11
To summarize, we examine if a regression coefficient/​b is statistically sig-
nificant –​i.e. large enough to be present in the unknown population –​by
examining its t-​value or significance probability, which amounts to the same.
When we report the results of regression (see Chapter 10), we usually present
the significance probabilities. Several independent variables in the regression
do not complicate this, since each regression coefficient has its own t-​value
and significance probability (p).

4.6 Other critical issues regarding significance testing


A number of critical issues regarding significance testing are relevant for those
embarking on their first regression project. Here are some of them:
Statistically significant ≠ scientifically important. That a regression coeffi-
cient is significant at the 5% level means… just this. It does not mean that
the b in question is important in a substantial or practical way. Theory, prior
research and sound judgement must be used to assess whether or not the
actual “change in y associated with a one-​unit increase in x” is large, medium
or small in real life, or if it has policy-​relevant consequences. This point is also
related to the next one.
In large samples all regression coefficients become statistically significant. Since
larger samples by necessity leads to smaller standard errors (cf. the formula

a.indd 77 27-Aug-19 15:04:14


78

78 The foundations
for the standard error in section 4.4.), it follows that many tiny regression
coefficients are statistically significant in large samples (n > 1 000). From
this it also follows that we cannot use differences in significance levels to dis-
criminate among independent variables in terms of their relative practical
importance.
The null hypothesis is often stupid! In our example, H0 states that the regression
coefficient for square meter is zero. Given the knowledge of how the housing
market works this appears, well, at least questionable… To come around this,
Stata has the capability to assess an alternative H0 –​for example, that our
sample b (i.e. 3 402) is larger than 2 500. After the command yielding Stata-​
output 4.1, we simply write “test sq_​ m = 2500” in the command window
and press Enter. This yields:

. test sq_​
m = 2500
( 1) sq_​m = 2500
F( 1, 323) =    9.44
Prob > F = 0.0023

In other words, against the null that the b for square meter is 2 500, the
reported b is significant at the 0.23% level. My limited knowledge of R
prevents me from doing this analysis in that program, but I suspect it is pos-
sible. See Wickham and Grolemund (2016) for an introduction to statistical
analysis in R.
The 5% significance level is arbitrary. A whole issue of American Statistician
(2019, Vol 73, No 51) was recently devoted to the misuse of significance
testing in much current practice. At the center stage of the critical remarks
was the tendency of viewing research findings as either significant/​important
(p < 0.05) or insignificant/​non-​important (p > 0.05). Such binary thinking
should be abandoned! Instead, I encourage students (with the support of a
reviewer of this book) to take a continuous, as opposed to binary, perspective
on significance testing.
Significance testing on non-​representative samples and populations. Recall the
introduction to this chapter: A sample generated by some random mechanism
lies at the core of significance testing as it is presented here. But many of the data
used by social scientists/​economists today are not samples in this sense, such as
convenience samples and populations. May we still apply significance testing in
the customary manner? Some are skeptical (Berk, 2004; Freedman et al., 1998),
and some are not (Hoem, 2008; Lumley, 2010). The prevailing practice seems to
be to treat the data, no matter how they were generated, as a miniature of some
imaginary population and, with this justification, carry on doing significance
testing in the traditional manner. For more on this, see Sterba (2009).
One-​sided and two-​sided tests of significance. The presentation so far has
referred to what is known as a two-​sided significance test, i.e. the default in

a.indd 78 27-Aug-19 15:04:14


79

Samples, variability and significance 79


statistical research. This means that we, for large samples and a 5% signifi-
cance level, reject H0 if the t-​value is larger than 1.96 or smaller than –​1.96;
see Figure 4.3. But sometimes it is more prudent to employ a one-​sided test.
Let’s take another look at our square meter–​price regression. If H1 is that
square meter is positively related to sale price, we should only reject H0 if the
t-​value is above 1.96 on the right-​hand side of the figure (and not on the left-​
hand side of –​1.96). This is the one-​sided test. The upshot? If you have reason
to believe that that your x-​variable has a positive or negative effect on y, you
should employ a one-​sided test. If so, the actual significance level reported in
the statistics output is only half of what this number implies. That is, a signifi-
cance probability of, say, 0.099 (9.9%) from the statistics output means that
your regression coefficient is significant at the 5% level.

4.7 Summary and further reading


Regressions in the social sciences seek to know if and how x1 and x2 and so
on are associated with y. In most cases, however, while this question pertains
to a population, the analysis is done on a sample from this population. If
this sample is representative of the population, which requires some kind of
random sampling procedure, we use significance tests to find out if b1 and b2
and so on are large enough to warrant attention in the unknown population.
If they are, we claim the bs to be statistically significant. By this we mean
that the bs are of such magnitudes that it is unlikely that this has occurred by
random chance. The purpose of this chapter has been to explain this endeavor
in an accessible way, i.e. without going into many of the hard-​grasped techni-
calities of this subject.
More could be said about hypothesis testing, but this should suffice for a
first regression project. For those wanting to learn more about the inferential
statistics touched upon in this chapter, consult for example Agresti and Finlay
(2014), Freedman et al. (1998) or Wheelan (2013). Especially the latter is a
user-​friendly introduction to statistics in general. For a summary of the cri-
tique against “bad” significance testing, see Wasserstein et al. (2019).
Within a regression framework, Berk (2004; Chapter 4) and Wooldridge
(2000; Chapters 4 and 5) cover much of the particulars in the present chapter.

Notes
1 Many types of data on which regression is used are not (random) samples in the
traditional sense, such as the soccer player data analyzed in section 2.4 and the
sausage data analyzed in section 2.6. More on this in section 4.6.
2 In practice, more complicated random draw procedures are usually used, but this
is beside the key point. See for example Freedman et al. (1998) in this regard.
3 This illustration is a simplification of the so-​called Central Limit Theorem (CLT).
For the CLT to work properly, the sample should be greater than 50 and the popu-
lation preferably much greater than 400. But the take-​home message of the CLT
is that a large sample (i.e. n > 120) from a well-​defined population drawn by some

a.indd 79 27-Aug-19 15:04:14


80

80 The foundations
random sampling mechanism will approximately resemble the population from
which it is drawn. For more on this in a non-​technical fashion, see Wheelan (2013).
4 Strictly speaking, the exact number is 1.96. But, as the saying goes, 2 is close
enough for government work.
5 The formula for computing the standard error is complicated and thus not some-
thing to be bothered with when we have statistics programs at our disposal. See for
example Fox (1997), Allison (1999) or Wooldridge (2000).
6 The interval differs from the interval under the heading of “95% Conf. Interval”
due to rounding, and the fact that Stata uses 1.96 in the calculation (see note 4).
7 The alternative hypothesis in (a) suggests a two-​sided significance test, whereas the
alternative hypothesis in (b) suggests a one-​sided significance test. See section 4.6
for more on this.
8 Testing the alternative hypothesis is a verification or confirmation approach;
testing the null hypothesis is a falsification approach. The latter is the dominant
one taken in statistical research.
9 Assuming a large sample, which in practice is a sample of approximately 120 units
or more. If the sample size is smaller than this, t-​values must be slightly larger
than 2.
10 A low t-​value could be the result of a small numerator (i.e. the regression coeffi-
cient/​b) or a large denominator (i.e. the standard error of b), but is in my experi-
ence most often the former.
11 Note that a low significance probability (p) for H0 is not the same as a high prob-
ability of H1 being true.

a.indd 80 27-Aug-19 15:04:17


81

5 
The assumptions of regression analysis

5.1 Introduction
Chapter 4 explained how to get a confidence interval for a population regres-
sion coefficient (i.e. an unknown parameter) from a precise sample regression
coefficient (i.e. an estimate) provided that everything was well with the regres-
sion model in question. In this sense, Chapter 4 was an idealistic enterprise.
What, then, is the worst thing that can go wrong in a regression analysis,
given that the statistics program crunched the correct numbers and gave you
an output to consider? Probably that your all-​else-​equal regression estimate
of interest, b1, is nowhere close to its corresponding real-​life but unknown
population regression coefficient.1 How can we avoid this, i.e. to get a biased
(i.e. too high or too low) estimate of b1? By making sure our regression model
meets the assumptions of regression analysis. That is, provided the regression
lives up to the assumptions of regression analysis, we have sound reason to
believe that our b1 estimate is near in magnitude to its real-​life equivalent.
Allison (1999) stated 20 years ago that there was lots of confusion about the
assumptions of regression analysis. In my assessment this has not improved.
For example, some assumptions are less important than others, some are vital
in certain settings but not in others and some are, on closer inspection, not
assumptions at all. The aim of this chapter is therefore to explain and illus-
trate the regression assumptions in a more intuitive fashion than what appears
as the norm in most prior textbooks on the subject. The presentation assumes
that the focal point of the analysis is to get an unbiased all-​else-​equal estimate
of b1 that comes as close as possible to this coefficient’s real-​life but unknown
counterpart. That one also wants an unbiased all-​else-​equal estimate of b2 or
b3 with this property does not complicate matters –​at least not in principle.

5.2 The assumptions you should think about before doing your
regression and, quite often, before you choose your data (A)
A1 All relevant independent variables are included in the regression model.
This most important assumption of regression, given the aim of finding the
unbiased all-​else-​equal effect of x1 on y, is untestable in strict terms. When all

a.indd 81 27-Aug-19 15:04:17


82

82 The foundations
relevant independent variables are in included in a regression, the estimate of
b1 will not change when a new independent variable is added to the model (but
the SE of b1 might). In Chapter 3, in contrast, we saw several examples of bs
changing in size when a new independent variable was added to the regres-
sion. Technically, we refer to this as omitted variable bias. Indeed, avoiding
omitted variable bias is the major motivation for doing multiple regression
instead of only bivariate regression. The point now is that it is always conceiv-
able that some variable that we do not have access to in our data, if this was to
be included in our regression, might bring about a change in the size of b1. In
this sense the value of b1 is always uncertain to some extent. This has one key
implication: Before embarking on a regression project aiming at finding the
unbiased effect of x1 on y, you should make sure your data include all relevant
independent (control) variables of y.2 In practice, though, since this typically
is impossible, you must suffice with hopefully the most relevant controls. And
you must be prepared to argue why this is the case.
A2 All non-​relevant independent variables are excluded from the regression
model. This assumption is often mentioned in textbooks, and this is why
I bring it up. In the real life of doing regression, though, it plays a minor
role. Indeed, if a regression contains one or two non-​relevant independent
(control) variables compared to five or six relevant ones, this generally has
no practical consequences for the independent variable of focal interest (cf.
Wooldridge, 2000). More importantly, many regressions aim towards finding
out whether or not a specific and perhaps policy-​relevant x1 has an impact
on y. If such an analysis shows that x1 has a minor and non-​significant effect
on y, this obviously does not make it “non-​relevant” in the regression model.
That is, a non-​finding could be an important finding in a number of cases.

5.3 The testable assumptions not requiring a random sample from a


population-​framework (B)
B1 The independent variables have additive effects on y. This means that it is
reasonable to think of the variation in y as a function of adding up changes in
a set of independent variables, x1 to xn. This assumption of additivity is ques-
tionable in some cases, and we return to it in section 6.2.
B2 The linearity assumption. Often, if not stated otherwise, regression equates
with linear regression. The linearity assumption states that x1 in the data, as
in real life, should have an approximately linear relationship with y. If the
relationship between x1 and y is non-​linear, the estimate of b1 will be biased
and measures must be taken to rectify this. These remedies will be addressed
in sections 6.3, 6.4 and 6.5. Linearity, as mentioned early in the book (cf.
Figure 2.4), means that that the regression line is approximately straight.
How to test this? The most common approach is a visual inspection. Stata
has multiple options in this regard, and the one I prefer is the multivariable
scatterplot smoother (the general command is mrunning y x1 x2). The

a.indd 82 27-Aug-19 15:04:17


83

The assumptions of regression analysis 83


mrunning routine is a user-​written command that may be downloaded from
the Internet (write “findit mrunning” in the command window of Stata,
press Enter and follow the instructions). The multivariable smoother puts no
a priori restrictions on the relationship between x1 (or x2) and y, but lets the
data speak for themselves in this regard. Beware that this kind of analysis is
computer-​intensive and that running it on large samples with several inde-
pendent variables takes much time.
Once again we return to the data on houses and their prices to illustrate.
The independent variables are square meter and distance to city center in
km. The regression appears in Stata-​output 5.1. Both coefficients are signifi-
cant with large t-​values, suggesting low probabilities of observing coefficients
of, respectively, 2 600 and –​23 000 if the true coefficients are zero in the
population, as H0 implies. That said, this regression takes for granted that
the two variables have linear relationships with sale price. Is this a tenable
assumption? Figure 5.1, the smoother, answers this directly in the sense that
the two regression lines summarize the relationship between the variables of
interest without imposing a linear relationship. Big-​ticket question: Do the
lines deviate from straight lines? No! It therefore makes very much sense to
think of these relationships as approximately linear.

Stata-​output 5.1 Regression of house sale prices (price_​Euro) on house sizes


(square_​m) and distance to city center (dist_​city_​c).

. use "C:\Users\700245\Desktop\temp_myd\housedata.dta"

. reg price_Euro square_m dist_city_c

Source | SS df MS Number of obs = 100


-------------+---------------------------------- F(2, 97) = 91.77
Model | 1.2871e+12 2 6.4356e+11 Prob > F = 0.0000
Residual | 6.8024e+11 97 7.0128e+09 R-squared = 0.6542
-------------+---------------------------------- Adj R-squared = 0.6471
Total | 1.9674e+12 99 1.9872e+10 Root MSE = 83743

------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
square_m | 2587.724 217.0158 11.92 0.000 2157.007 3018.44
dist_city_c | -23198.13 3836.473 -6.05 0.000 -30812.47 -15583.8
_cons | 155917.4 34448.68 4.53 0.000 87546.26 224288.5
------------------------------------------------------------------------------

What if the scatterplot smoother shows one or more clear deviations from
a straight line? Then we must turn to the tools available for us in sections 6.3,
6.4 and 6.5.
B3 No perfect multicollinearity. Let’s say the earnings for a sample of employees
is the dependent variable. Suppose further that the research question is to find
out how years of working experience and years of age explain variation in
these earnings. Assume further, for simplicity, that everybody in the sample
started to work at the firm in question right out of college, and that no one

a.indd 83 27-Aug-19 15:04:18


84

84 The foundations

Figure 5.1 House sale prices by square meter and distance to city center, with
multivariable scatterplot smoother lines. 100 houses sold in a small
Norwegian city during 2017.
Graph generated in Stata.

has had any spells of absence. If we were to do the regression earnings by


working experience and age, would we get sensible all-​else-​equal estimates
of the regression coefficients of interest? The answer is no. The reason is
that once an employee gets one more year of working experience, he or she
also becomes one year older and vice versa –​with no exceptions! Put differ-
ently, there is no unique variation in either working experience or age that the
regression may use to decompose the variation of earnings into.3 This is the
extreme case of perfect multicollinearity.
Perfect multicollinearity is typically a thought experiment. Yet the problem
remains for independent variables that are “highly collinear”, which in plain
speak means that two (or more) independent variables share a high correl-
ation; cf. the overlap in Figure 3.4. (Correlations range from –​1 to 1 in size,
and –​1 and 1 are perfect correlations.) Unfortunately, however, there is little
consensus as to what constitutes a “too high” correlation. Some have no
quarrels with making the all-​else-​equal interpretation of b for an independent
variable that shares a correlation of ± 0.80 with another independent variable

a.indd 84 27-Aug-19 15:04:18


85

The assumptions of regression analysis 85


in the model. Others, in contrast, are reluctant to do so for a ± 0.60 correlation.
Typically, the correlations in question are expressed as VIF-​scores (variance
inflation factors), and every statistics package computes them. Stata-​output
5.2 extends the regression reported in Stata-​output 5.1 and shows the VIF-​
scores for the independent variables. The command “vif” is written immedi-
ately after the regression command in the command window.

Stata-​output 5.2 Regression of house sale prices on a set of independent variables and


a test of VIF-​scores.

. reg price_Euro square_m dist_city_c renovate sec_living

Source | SS df MS Number of obs = 100


-------------+---------------------------------- F(4, 95) = 46.48
Model | 1.3021e+12 4 3.2551e+11 Prob > F = 0.0000
Residual | 6.6530e+11 95 7.0031e+09 R-squared = 0.6618
-------------+---------------------------------- Adj R-squared = 0.6476
Total | 1.9674e+12 99 1.9872e+10 Root MSE = 83685

------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
square_m | 2517.588 238.267 10.57 0.000 2044.568 2990.607
dist_city_c | -22426.32 3898.151 -5.75 0.000 -30165.12 -14687.51
renovate | 25259.94 17293.2 1.46 0.147 -9071.401 59591.28
sec_living | 1688.684 22708.81 0.07 0.941 -43394 46771.36
_cons | 151005.1 35092.74 4.30 0.000 81337.17 220673
------------------------------------------------------------------------------

. vif

Variable | VIF 1/VIF


-------------+----------------------
square_m | 1.21 0.827586
sec_living | 1.18 0.848759
renovate | 1.06 0.940088
dist_city_c | 1.03 0.966284
-------------+----------------------
Mean VIF | 1.12

The new independent variable in the regression is whether a house


has a second living room or not (sec_​living), and whether a house is
recently renovated or not (renovate). We note that both dummies have
insignificant effects on sale price. The VIF-​scores appear below the output
and range from 1.03 to 1.21. In most books on regression, we are told that
multicollinearity is a concern when VIF-​scores reach 10. I am skeptical about
such a high threshold value, and I am not alone. Berk (2004) is worried about
multicollinearity when VIF-​scores are around 5, and Allison (1999) worries
when they are larger than 2.5. In any event, multicollinearity is of no con-
cern in the present case.4
What if the VIF-​score for an independent variable is larger than, say,
6? First, we should be cautious in ascribing this variable’s b as the all-​
else-​equal effect. Second, this b’s standard error will be enlarged. For

a.indd 85 27-Aug-19 15:04:18


86

86 The foundations
this reason, the b will have a harder time reaching statistical significance.
(Remember: b/​SEb = t-​value.) What, then, can be done with “too high”
collinearity? Authors diverge in their recommendations. One solution is to
add more units to the regression (if possible), hoping that some of these
new observations have lower correlations than the existing ones between the
independent variables causing the problem. Another solution is to create
an index of the independent variables in question (cf. section 2.5). A third
is to drop one or more of the independent variables causing the problem,
but this may be at odds with the A1 assumption. Finally, an old tune says
“do nothing” if the regression coefficients of interest are theoretically rea-
sonable and significant despite having large standard errors. It should also
be mentioned that “high correlations” between independent variables only
is a concern for the independent variables in question, and that this has no
bearing on the regression coefficients for the other independent variables in
the regression model.
In short: Although only perfect multicollinearity is a strict violation of
the regression assumption, the more common “too high” collinearity problem
leads to inflated standard errors and an identification problem for the b of
interest: We cannot trust, even when all other regression assumptions are met,
that the regression coefficient is the unbiased all-​else-​equal effect.
The aforementioned regression assumptions are mentioned in every text on
the subject. The next one is typically not, but see Wooldridge (2000).
B4 No influential outliers. Take a look at Figure 5.2 (former Figure 1.1).
At least three of the houses in the graph, i.e. those pointed at by arrows,
are located a bit away from the other houses. In this sense they are poten-
tial outliers. Now, think of two y = b0 + b1x1 regressions. The first is run
on all the houses in the data; the second on all houses minus the potential
outliers. If these regressions yield the same estimates of b1, you do not have
an outlier problem. In contrast, if they produce dissimilar estimates of b1,
you have a problem. That is, your outliers are influential. What to do? In
some cases, say when outliers are few compared to the total number of
units, it could be ok to delete them. In others, say when analyzing a sample
representative of some population, deleting observations might cause your
sample to become non-​representative. This is generally not ok. There are
lots of diagnostic tools available for identifying outliers; see for example
Meuleman et al. (2015). In the house data above there is no influential out-
lier problem.
Summing up, if the assumptions in sections 5.2 and 5.3 are met, you could
be pretty sure that your b1 of main interest comes close to its unknown, real-​
life counterpart. But sometimes, if not most often, your data is a random
sample from a well-​defined population. In this case we need some additional
assumptions to make valid inferences from the sample b1 to its population
counterpart. This is the topic of section 5.4.

a.indd 86 27-Aug-19 15:04:19


87

The assumptions of regression analysis 87

Figure 5.2 House sale prices by house sizes. 100 houses sold in a small Norwegian city
during 2017.

5.4 The additional testable assumptions for a random sample


from a population-​framework (C)
The error term, e, was mentioned in section 2.7. This e is an unknown, real-​
life population variable or parameter. The present regression assumptions
deal with this e and say it should be homoscedastic, normally distributed
and uncorrelated. When examining these assumptions empirically, however,
we test the residual mentioned in section 2.6. (The residual is a sample vari-
able you may compute after every regression you run on your computer.)
We address the properties of the residuals in turn. At the outset it should be
emphasized, however, that violations of any of these three assumptions have
no consequences for the size of the estimate of the b1 of interest.
C1 Homoscedastic residuals. Although this assumption sounds difficult,
it is not. In plain English, the assumption states that the spread of the
observations around the regression line should be of equal size for the different
values of x1 (and x2 and so on). Note the regression in Figure 5.3 (former
Figure 2.5): Performance correlates positively with earnings, as indicated
by the upward-​sloping regression line. But note also the spread in earnings
around this regression line, which is larger for better performance.5 In short,

a.indd 87 27-Aug-19 15:04:19


88

88 The foundations

Figure 5.3 Earnings per year by player performance, with regression line. 242 players
in the Norwegian version of the Premier League.
Graph generated in Stata.

the spread is heteroscedastic (i.e. unequal) and a violation of the assumption


of homoscedasticity (i.e. equal).
We normally check this assumption by a residuals versus predicted
values plot. After running the regression in Stata, you write “rvfplot,
yline(0)” in the command window and press Enter. The result appears in
Figure 5.4. The horizontal line is a transformed regression line. We make the
same conclusion as before –​the residuals are heteroscedastic.
Sometimes we address the question of homoscedasticity versus
heteroscedasticity using a statistical test, much in the same way we used signifi-
cance tests in Chapter 4. Immediately after the regression command, we may
in Stata write “estat hettest” and press Enter in the command window.
This yields:

. estat hettest
Breusch-​
Pagan /​Cook-​
Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of EarnMill
chi2(1) =   26.87
Prob > chi2 = 0.0000

a.indd 88 27-Aug-19 15:04:19


89

The assumptions of regression analysis 89

Figure 5.4 Earnings per year by player performance, residuals versus predicted values
plot. 242 players in the Norwegian version of the Premier League.
Graph generated in Stata.

We firmly reject the null hypothesis of homoscedasticity; p < 0.0001. There is


no need to get into the technicalities of this test (cf. Wooldridge, 2000).
What is the consequence of heteroscedasticity? The answer is that the esti-
mate of b1’s standard error (SE) will be biased, i.e. too high or too low. This,
in turn, affects the significance test. (Recall that b/​SEb = t-​value.) If the SE of
b1 in your statistics output is biased downwards, then the true t-​value is lower
than what the output suggests. Therefore, you might claim b1 to be significant
while this in fact might not be the case.
What to do about heteroscedasticity? In recent years, and provided access
to a comprehensive statistics package, the typical remedy has been to com-
pute robust SEs. These SEs are, in other words, correct even in the presence
of heteroscedasticity. Stata-​output 5.3 illustrates for the performance-​
earnings regression. Whereas the normal SE is 11 794 (cf. Stata-​output 2.8),
the robust SE is 12 776. (To get robust SEs in Stata you just add “, rob”
to the regression command.) Using a robust SE has no bearing on the sig-
nificance level in this case; in others the t-​value may “plummet” from > 1.96
to < 1.96. In section 6.5 we take a look at another solution for dealing with
heteroscedasticity.

a.indd 89 27-Aug-19 15:04:20


90

90 The foundations
Stata-​output 5.3 Regression of earnings (earnings_​E) on player performance
(performance), with robust standard error (SE).

. use "C:\Users\700245\Desktop\temp_myd\football_earnings.dta"

. reg earnings_E performance, rob

Linear regression Number of obs = 242


F(1, 240) = 24.45
Prob > F = 0.0000
R-squared = 0.1068
Root MSE = 90054

------------------------------------------------------------------------------
| Robust
earnings_E | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
performance | 63176.56 12775.92 4.94 0.000 38009.31 88343.81
_cons | -173364 56597.15 -3.06 0.002 -284854.6 -61873.36
------------------------------------------------------------------------------

C2 Normally distributed residuals. This assumption is also mentioned in


every textbook on regression, but it is only relevant for small samples. The
assumption states that if we graph the distribution of the residuals, akin to
the distribution of m2 for the 325 apartments in Figure 4.1, they should be
normally distributed. Figure 5.5 shows the distribution of the residuals for

Figure 5.5 Earnings per year by player performance, normal probability plot of


residuals. 242 players in the Norwegian version of the Premier League.
Graph generated in Stata.

a.indd 90 27-Aug-19 15:04:21


91

The assumptions of regression analysis 91


the performance-​earnings regression. The solid line shows the normal distri-
bution, and the dashed line the distribution of the actual residuals from the
regression model. Clearly, the assumption of normality is violated.
(The general code in Stata to obtain the graph in Figure 5.5 is first to do the
regression in the usual manner: “reg y x1 x2”. Then you write “predict
r, resid” in the command window and press Enter. This makes Stata com-
pute the residuals, here denoted r. Then you write “kdensity r, normal”
in the command window and press Enter again to produce the graph.)
Often the normality of residuals-​assumption is tested formally by means of
a significance test. After running the regression and computing the residuals
(i.e. naming them r), we write “swilk r” in the command window and press
Enter. This yields:

. swilk r

Shapiro-​
Wilk W test for normal data

Variable | Obs W V z Prob>z


-​-​-​-​-​-​-​-​-​-​-​-​-​+-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​-​​
r | 242 0.84545 27.239 7.676 0.00000

The significance probability on the right-​hand side suggests that the null
hypothesis of normality of the residuals is clearly rejected (p < 0.000001).
There is no need to go into the technicalities of this test and related ones.
What is the consequence of the normality of residuals-​ violation? As
mentioned, in large samples, the Central Limit Theorem ensures that non-​
normal residuals are unproblematic. For small samples, in contrast, non-​
normality of the residuals might entail that we cannot trust the tests of
significance (i.e. p values) to be accurate.
C3 Uncorrelated residuals. This assumption is foremost a question of how
the data were generated. If your data is a random sample of people having
nothing in common other than randomly being drawn out of a bucket, you
have little to worry about. That is, this assumption means that what person
A answers in his questionnaire is uncorrelated with (i.e. independent of)
what B answers in his or hers. But let’s say ten random pupils in each class
at an elementary school answer a questionnaire, and that the y-​variable is
an assessment of the pupils’ learning environment on a 1 to 10 scale. Will
the answers to this question be uncorrelated among the school’s pupils?
Probably not, since pupils in the same class share the same learning environ-
ment. Also, consider the earnings of the soccer players previously analyzed
in this chapter. Are these uncorrelated among the players? Probably not,
since the earnings of players on the same team are dependent of each
other due to a fixed salary budget. These two exceptions aside, correlated
residuals are under normal circumstances not something to worry about

a.indd 91 27-Aug-19 15:04:21


92

92 The foundations
when analyzing cross-​ sectional data. When the data are longitudinal,
such as repeated observations for the same person, firm, stock option or
country, the problem of correlated residuals is pervasive and must be dealt
with accordingly.6
What is the consequence of correlated residuals within a cross-​sectional
framework? As before the problem is that the SE of b1 gets biased downwards.
That is, the statistics output will report a SE which is smaller than it really is.
For this reason, one might conclude that b1 is significant when it is not. The
solution is to calculate a SE which takes the clustering of observations (in
classes, in teams etc.) into account. This cluster-​robust SE appears in Stata-​
output 5.4 for the soccer players’ earnings regression (i.e. 22 727). It is much
larger (indicating less precision) than the normal SE (11 794; Stata-​output
2.8) and the robust SE (12 776; Stata-​output 5.3). For this reason, the level
of significance of the regression coefficient for player performance is much
higher than in previous analyses (p < 0.015).
To obtain a cluster-​robust SE in Stata, you add “, cluster(variable
name)” to the usual regression command. (Here, the variable club_​rank
specifies that a cluster-​robust SE taking the 16 different teams in the data into
account should be calculated.)

Stata-​output 5.4 Regression of earnings (earnings_​E) on player performance (per-


formance), with a cluster-​robust standard error.

. reg earnings_E performance, cluster(club_rank)

Linear regression Number of obs = 242


F(1, 15) = 7.73
Prob > F = 0.0140
R-squared = 0.1068
Root MSE = 90054

(Std. Err. adjusted for 16 clusters in club_rank)


------------------------------------------------------------------------------
| Robust
earnings_E | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
performance | 63176.56 22727.45 2.78 0.014 14734.15 111619
_cons | -173364 98449.44 -1.76 0.099 -383204 36476.04
------------------------------------------------------------------------------

5.5 Summing up the regression assumptions, some other issues and


further reading
Summing up the (linear) regression assumptions, we have:

A The untestable assumptions (in strict terms)


A1 All relevant independent variables are included in the regression model
A2 All non-​ relevant independent variables are excluded from the
regression model

a.indd 92 27-Aug-19 15:04:21


93

The assumptions of regression analysis 93

B The testable assumptions not requiring a random sample from a


population-​framework
B1 The independent variables have additive effects on y (if violated, see
section 6.2)
B2 The linearity assumption (if violated, see sections 6.3, 6.4 and 6.5)
B3 No perfect multicollinearity
B4 No influential outliers

C The additional testable assumptions for a random sample from a


population-​framework
C1 Homoscedastic e (tested on residuals)
C2 Normally distributed e (tested on residuals)
C3 Uncorrelated e (tested on residuals)

There is lots of confusion about the linear regression assumptions, as stated


in the introduction to this chapter. This state of affairs is also in part due
to differences in terminology. When assumptions A1, A2 and B1 are met,
for example, this often goes by the name of a “correctly specified regression
model” in many textbooks. Conversely, if these assumptions are not met, the
model is misspecified. You will also meet the zero conditional mean, or mean
independence assumption, in many books. This states that the average of the
error term, e, should be zero (see section 2.7). If the assumptions mentioned
above are met, this zero conditional mean will generally be the case.
Another regression assumption sometimes mentioned is that of no meas-
urement error in the variables. Generally, we have that the poorer a concept
(as operationalized by a certain variable) is measured, the more imprecise its
regression coefficient becomes. However, in most cases we have the data we
have, and little can be done about this. If we collect our own data, in contrast,
we may improve the measurement of variables by creating indexes based on
several variables measuring the same (underlying) concept.
A final issue concerns normality. In some books you will meet the
“assumption” that y should be normally distributed. This, for all intents and
purposes, is nonsense. If this indeed was a regression assumption, 90% or so
of all regressions would be invalid. Berry (1993) provides a thorough discus-
sion of the linear regressions assumptions; see also Allison (1999; Chapters 3,
6 and 7) and Wooldridge (2000; Chapters 3 and 8).

Notes
1 This applies to cases where the data are random samples from a well-​defined popula-
tion. In cases where they are not, which increasingly is the case in economics and the
social sciences, the aim of the regression would be to get the unbiased real-​life and
all-​else-​equal regression coefficient without any reference to a specific population.
2 If you collect your own data, for example through a questionnaire or through
detective work on the Internet, you have in general more to say about control vari-
able completeness than when using datasets gathered by others.

a.indd 93 27-Aug-19 15:04:21


94

94 The foundations
3 Say we have the regression y = b1x1 + b2x2. To identify the all-​else-​equal effects
of x1 and x2, as expressed by b1 and b2, we need some observations for which x1
does not change when x2 changes or vice versa. In other words, in the presence of
multicollinearity, it is impossible to increase x1 by one unit while holding x2 con-
stant (Berry, 1993).
4 The column “1/​VIF” contains the so-​called tolerance values, which some prefer to
VIFs. A VIF value of 2.5 is thus a tolerance value of 0.4.
5 We assume that these data pertain to a random sample of soccer players, although
we know for a fact they actually pertain to a population of such players.
6 The typical test for serial correlation is the Durbin-​ Watson test (see
Wooldridge, 2000).

Exercises with data and solutions


The following exercises build on the exercises from Chapter 3.

Exercise 5A
Use the housedata.dta.
5A1: Estimate the following regression model:
price_​Euro = sec_​living
What does this bivariate regression tell you regarding statistical significance?
5A2: Estimate the following regression model:
price_​
Euro = sec_​
living + renovate
What does this multiple regression tell you regarding statistical significance?
5A3: Estimate the following regression model:
price_​
Euro = sec_​
living + renovate + square_​
m
(i) What does this multiple regression tell you regarding statistical
significance?
(ii) What does this multiple regression tell you regarding multicollinearity,
homoscedasticity, normally distributed errors and uncorrelated
errors? (We already know, from Figure 5.1, that square_​m has a
linear effect on price.)

Exercise 5A, solution


5A1:
The t-​value for the regression coefficient is 3.17, and the significance prob-
ability (p) is 0.002. The regression coefficient for sec_​living is thus signifi-
cant by conventional standards. A regression coefficient of 107 070 is, in other
words, very unlikely to appear in repeated sampling if the true regression coef-
ficient in the population is zero.
5A2:

a.indd 94 27-Aug-19 15:04:21


95

The assumptions of regression analysis 95


The t-​values for the regression coefficients are 3.20 (sec_​living) and
2.86 (renovate), and the significance probabilities are, respectively, 0.002
and 0.005. Both coefficients are thus significant by conventional standards.
Both coefficients are so large that it is very unlikely that they will come up in
repeated sampling given true regression coefficients of zero in the population.
5A3:
(i) The t-​value for the regression coefficient of sec_​living is 0.70,
and the significance probability (p) is 0.485. This b is thus not sig-
nificant by conventional standards. A regression coefficient of 18
225 is, in other words, likely to appear in repeated sampling if the
true regression coefficient in the population is zero. The regression
coefficient of renovate is borderline significant (t-​value = 1.97; p
value = 0.052). (Note that if we employ a one-​sided test of signifi-
cance, which here seems justified, the p value becomes 0.026.) The
coefficient for square_​m is significant by conventional standards,
as to be expected.
(ii) Multicollinearity is of no concern with no VIF-​scores being larger
than 1.21 and a mean VIF-​score of 1.14. Testing for heteroscedasticity,
we cannot reject the null hypothesis of homoscedasticity (p < 0.929).
The errors are for all intents and purposes normally distributed.
Whether or not the errors are uncorrelated –​meaning if the sale price
of one house are independent of the sale price of another house –​
is difficult to have an informed opinion about. It might be the case
that houses situated in the same neighborhood sell for more/​less than
“similar” houses located elsewhere simply due to their favorable/​
unfavorable location. If this is the case in general, something I doubt
here, the errors might be correlated.

Exercise 5B
Use the data football_​earnings.dta.
5B1: Estimate the following regression model:
earnings_​E = games_​int
What does this bivariate regression tell you regarding statistical significance?
5B2: Estimate the following regression model:
earnings_​
E = games_​
int + age

(i) What does this multiple regression tell you regarding statistical
significance?
(ii) What does this multiple regression tell you regarding linearity,
multicollinearity, homoscedasticity and normally distributed errors?
(We already suspect, from section 5.4, that the errors probably are
correlated.)

a.indd 95 27-Aug-19 15:04:22


96

96 The foundations
Exercise 5B, solution
5B1:
The t-​value for the regression coefficient is 6.36, and the significance prob-
ability (p) is < 0.0001. The b for games_​int is thus significant by conven-
tional standards. A regression coefficient of 3 044 is, in other words, very
unlikely to appear in repeated sampling if the true regression coefficient in the
population is zero.
5B2:
(i) The t-​values for the regression coefficients are 5.18 (games_​int)
and 3.21 (age), and the significance probabilities are, respectively,
< 0.0001 and 0.002. Both coefficients are thus significant by conven-
tional standards. Both coefficients are so large that it is very unlikely
that they will come up in repeated sampling given true regression
coefficients of zero in the population.
(ii) Using the multivariable scatterplot smoother (Figure 5.1), we see that
none of the independent variables have a linear effect on earnings.
The remedy for this is covered in sections 6.3, 6.4 and 6.5. With no
VIF-​scores being larger than 1.10, and a mean VIF-​score of 1.10,
multicollinearity is of no concern. Testing for heteroscedasticity, we
firmly reject the null hypothesis of homoscedasticity (p < 0.00001).
One possible solution to this problem is covered section 6.5. The
errors are nowhere close to being normally distributed, but this is
of no concern since the analysis is based on 242 players (i.e. the
“sample” is large).

Informal glossary to Part 2


Alternative hypothesis:
The hypothesis we actually believe in regarding the relationship between x
and y. In most cases this implies that there should be some kind of statistical
effect of x on y.
Assumptions (of regression analysis):
Collective term for the requirements that should be met in order to obtain
trustworthy results of a regression analysis (i.e. the topic of Chapter 5). These
assumptions are summed up in section 5.5.
Confidence interval (or region):
An interval delineating the minimum and maximum size of the regression
coefficient/​b in the population, in light of repeated sampling (see below).
Normal distribution:
A frequency distribution with a bell-​shaped form that plays a vital role in
inferential statistics.

a.indd 96 27-Aug-19 15:04:22


97

The assumptions of regression analysis 97


Null hypothesis:
The hypothesis we do not believe in regarding the relationship between x and
y, but which we actually test. In most cases the null hypothesis suggests a no-​
relationship between x and y.
Population:
All units/​observations of something, as in all adults in the USA or all SUV
cars in the world.
Random sample:
A sample (see below) taken from a population according to some random
draw procedure (e.g. like a lottery).
Repeated sampling:
For all intents and purposes a thought experiment in which (random) samples
are drawn repeatedly from some population and where some kind of stat-
istic (e.g. a mean or a regression coefficient/​b) is calculated for each new
sample drawn.
Sample:
A fraction or part of some larger population (see above).
Significance test (of regression coefficient/​b):
A broad term encompassing the test procedure to deem whether or not a
regression coefficient/​b in a sample is sufficiently different from zero, i.e. what
the null hypothesis suggests. If the test answer is yes, the regression coefficient/​
b of x is said to have a significant effect on y in the population from which the
sample is drawn.
Standard error (of regression coefficient):
A figure referring to the precision of which the regression coefficient/​b is
measured. The lower the standard error, the more precision in the meas-
urement. More precision, in turn, yields a narrower confidence interval (see
above) for the regression coefficient/​b.
t-​value (of regression coefficient/​b):
A benchmark used in a significance test (see above). In large samples (n >
120), a t-​value of 2 (or 1.96 to be exact) suggests that a regression coefficient/​
b is significant. (Regression coefficient/​b divided by the standard error of the
regression coefficient/​b = the t-​value.)

a.indd 97 27-Aug-19 15:04:22


98

a.indd 98 27-Aug-19 15:04:22


99

Part 3

The extensions

In Part 2, “The extensions”, you will get to know:

• The difference between additivity and non-​additivity, and how to model


the latter
• The difference between linear and non-​linear effects, and how to model
the latter
• How mediation models enrich traditional regression models
• How to use regression analysis on categorical (i.e. binary, unordered or
ordered) dependent variables
• How to implement instrumental variable (IV) regression models aiming
to reveal causal effects

a.indd 99 27-Aug-19 15:04:22


100

a.indd 100 27-Aug-19 15:04:22


101

6 
Beyond linear regression
Non-​additivity, non-​linearity
and mediation

6.1 Introduction
Sometimes the complexities of life entail that the linear regression model
paints a too simplistic picture of reality. If under these circumstances we hold
on to the linear model, the estimate of b1 for x1 on y could be biased. The pur-
pose of this chapter is to illustrate the ways in which we may adapt the linear
regression model to these more complicated circumstances. We first address
the violations of assumption B1 (section 6.2) and B2 (sections 6.3 to 6.5).
Finally, section 6.6 takes up the issue of mediation which will be explained
when we get there.

6.2 Non-​additivity (assumption B1): Interaction/​moderator


regression models
In all the y = b0 + b1x1 + b2x2 regressions in this book so far, it has been tacitly
assumed that the various regression coefficients, or effects, are of similar size for
every subgroup of observations. Indeed, this is what the additivity assumption
(B1) means. And without a theory or common sense telling us anything to the
contrary, this is a sound point of departure for statistical analysis. In some
cases, however, we have reason to suspect that this assumption of “constant
bs” does not sit well with reality. Moreover, we may even expect that the size
of b1 is larger or smaller for subgroup A than it is for subgroup B in real life.
Below we learn how to examine if this is the case and how to deal with it.
To set the stage, consider a regression between earnings (y) and years of
education (x1) and gender (x2) for a random sample of 300 employees. (We
omit other x-​variables for simplicity, but without loss of generality.) Let’s
say b1 is 1 000, meaning that one more year of education entails 1 000 more
in earnings given a dynamic interpretation, controlling for gender. So far, so
good. The additivity assumption –​i.e. the constant bs –​now says that years
of education’s effect on earnings is the same for men and women. But is it
necessarily so in real life? The short answer is no. The longer answer is that
research has shown that women, as compared with men, benefit more in terms

a.indd 101 27-Aug-19 15:04:22


102

102 The extensions

Earnings

Years of educaon

Figure 6.1 Stylized relationship between earnings (y) and years of education (x1) for
men (dashed line) and women (solid line).

of earnings from investing in one more year of education. Graphically we


may see the pattern in Figure 6.1. Note that although women earn less than
men on average (the solid line is lower on the y-​axis), the relationship between
education and earnings is steeper (i.e. has a larger b1) for women. That is,
the additivity assumption (B1) is violated. Additivity thus means parallel
regression lines.
How do we unearth relationships of the kind depicted in Figure 6.1? An
obvious solution is to do two separate regressions: one for women and one
for men. This procedure has one drawback, however, as it requires a rather
large number of observations. The more common procedure is to estimate
an interaction or moderator regression model. (Interaction and moderator
regression models are terms used interchangeably in the literature.) The pro-
cedure involves first creating, and then adding, an interaction variable to the
regression model. Here’s an example on how to do it.
We use a new dataset (stud_​phys_​17): a random sample of students who
have answered a questionnaire. The y is weekly hours of exercise(phys_​act_​
h), and the x-​variables are gender (women = 1, men = 0) and the students’
friends’ exercise levels (friend_​phys; a 0 to 4 scale where 0 denotes inactive
friends and 4 very active friends).1 We hypothesize that men and students
with physically active friends spend more time exercising themselves than
do women and students with inactive friends. Stata-​output 6.1 presents the
results for the standard, additive model.
The additive model. The output says that men exercise for 1.4 more hours per
week than women ceteris paribus (remember the coding of gender), and
similarly that a one-​unit increase in friends’ exercise levels entail 0.6 hours
more exercising per week in a dynamic sense. Both coefficients are statistically

a.indd 102 27-Aug-19 15:04:23


103

Beyond linear regression 103


significant at p < 0.0001, suggesting that both alternative hypotheses (i.e. one
for gender, and one for friends’ activity levels) are indirectly supported.

Stata-​output 6.1 Regression of hours of exercise per week (phys_​act_​h) on gender


(gender) and friends’ exercise levels (friend_​phys).

. use "C:\Users\700245\Desktop\temp_myd\stud_phys_17.dta"

. reg phys_act_h gender friend_phys

Source | SS df MS Number of obs = 657


-------------+---------------------------------- F(2, 654) = 23.58
Model | 456.164947 2 228.082473 Prob > F = 0.0000
Residual | 6325.32181 654 9.67174589 R-squared = 0.0673
-------------+---------------------------------- Adj R-squared = 0.0644
Total | 6781.48676 656 10.3376323 Root MSE = 3.1099

------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | -1.415949 .2547573 -5.56 0.000 -1.91619 -.9157081
friend_phys | .5927305 .1633656 3.63 0.000 .271946 .9135149
_cons | 4.125651 .4679033 8.82 0.000 3.206877 5.044425
------------------------------------------------------------------------------

This additive regression tacitly assumes that the regression coefficient


for friends’ exercise levels (0.593) is of the same size for men and women.
But let’s suppose some researcher has published a paper showing that these
coefficients are of different sizes. To examine this, we use a mechanistic pro-
cedure available in every statistics program. We first create an interaction term
(i.e. a product variable) between the two variables of interest. We then add this
interaction term to our initial regression model in Stata-​output 6.1. Stata-​
output 6.2 illustrates. Before the regression command, Stata is told to gen-
erate the variable wom_​f_​p, i.e. the product of gender and friend_​phys.
(In Stata commands * = ×.) This product term variable is then added to the
regression command in the usual way.2
The interaction model. The key to interpreting interaction models with a
dummy is to recall how the dummy is coded, i.e. 1 = women and 0 = men in
this case. The point is that the regression coefficient of interest (friend_​
phys) refers to the subgroup with the value of 0 on the dummy. That is,
for men the effect of friends’ exercise levels on hours of exercise is 1.09. For
women, in contrast, the analogous effect is 0.30, i.e. the effect of friend_​
phys (1.09) plus the effect of the product term wom_​f_​p (–​0.79; 1.09–​
0.79 = 0.30).3 Conclusion: The effect of friends’ exercise levels on students’
hours of exercise is larger for men than for women. The coefficient for
gender refers to students with 0 on the friends’ exercise levels variable.
The difference between the genders in exercise hours for this group is about
half an hour (0.58) in women’s favor, but this difference is not significant at
conventional levels.

a.indd 103 27-Aug-19 15:04:23


104

104 The extensions


Stata-​output 6.2 Regression of hours of exercise per week (phys_​act_​h) on gender
(gender) and friends’ exercise levels (friend_​phys) and their
product variable (wom_​f_​p).

. generate wom_f_p = gender * friend_phys

. reg phys_act_h gender friend_phys wom_f_p

Source | SS df MS Number of obs = 657


-------------+---------------------------------- F(3, 653) = 17.64
Model | 508.47031 3 169.490103 Prob > F = 0.0000
Residual | 6273.01645 653 9.60645704 R-squared = 0.0750
-------------+---------------------------------- Adj R-squared = 0.0707
Total | 6781.48676 656 10.3376323 Root MSE = 3.0994

------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .5774496 .8912158 0.65 0.517 -1.172545 2.327444
friend_phys | 1.087477 .267327 4.07 0.000 .5625532 1.612402
wom_f_p | -.786476 .3370497 -2.33 0.020 -1.448308 -.124644
_cons | 2.851303 .7181326 3.97 0.000 1.441175 4.26143
------------------------------------------------------------------------------

Students' hours of exercise per week


7 6
Hours per week
54
3

0 4
Friends' exercise level level (0 = lowest; 4 = highest)
male female

Figure 6.2 Visual illustration of the interaction effect between gender and friends’
exercise levels, based on the regression model reported in Stata-​output 6.2.

Interaction effects are more easily understood and interpreted when


presented in graphs. Figure 6.2 is based on the regression in Stata-​output
6.2.4 Note the steeper regression line for men, suggestive of a larger regression
coefficient for this subgroup. The figure also shows, on the left-​hand side, the
small gender difference for low values on the friends’ exercise levels variable.

a.indd 104 27-Aug-19 15:04:24


105

Beyond linear regression 105


One issue remains. Why did we conclude that there indeed was an inter-
action effect to speak of ? The answer is that the regression coefficient for the
product term (wom_​f_​p) had a significant effect on hours of exercise. This is
also the general idea. When an interaction variable (or product term) has a
significant effect, we conclude that the simpler additive model is inadequate
and gain support for the non-​additive model (i.e. interaction model). In con-
trast, when a product term has an insignificant effect, we keep the simpler,
additive model.5 This is Occam’s razor or the principle of parsimony in prac-
tice: We prefer simpler models to more complicated ones when we have no
reason to suspect that the simpler models are erroneous. In this case we had
such reason.
Let’s take stock: Sometimes reality bites in the sense that a regression coef-
ficient in real life is of varying size for different subgroups of observations. If
we then force our regression to be additive (i.e. commit a so-​called specifica-
tion error), we get a mismatch between reality and the regression, causing our
regression coefficient to be biased. To rectify this, we use an interaction/​mod-
erator regression model that allows a regression coefficient to be of unequal
size for different subgroups in the data. When do we test for an interaction
effect? When there is a theoretical or common-​sense reason to suspect that an
additive regression model is a too simple representation of reality.6

6.3 Non-​linearity 1 (assumption B2): Quadratic regression models


Non-​linearity concerns how the independent variable of interest, x1, is related
to y.7 The problem is that this relationship is not linear, as in the stylized rela-
tionship between x1 and y in Figure 6.3. (We now assume for simplicity, as
we also do in sections 6.4 and 6.5, that our regression is additive, i.e. that no
interaction terms are necessary.) How do we adapt the regression model to the
non-​linear scenario in Figure 6.3?
Thankfully there is also a very straightforward and mechanistic solution to
this problem. The strategy involves adding a quadratic or squared term to the
regression. (The square of x is x multiplied with x, i.e. x2.) By doing this, we
create an up-​and-​then-​down regression line like the one in Figure 6.3.
Below is an example using the soccer player data. The y is again yearly
earnings, and the independent variable is age. (More independent variables do
not cause additional difficulties.) Stata-​output 6.3 shows the computation and
results. The quadratic term is first created and then added to the regression
in the usual manner. The key to interpretation lies in the sign of the bs. If the
original variable (age) has a positive b and its quadratic (agesquared) a
negative b, we have the pattern depicted in Figure 6.3. This implies an inverse
U-​relationship between age and earnings: Players earn more as they age in
the beginning of their careers, but at some point their earnings reach a top
point and then start to decline with aging. This maximum point is calculated
by the formula –​b1/​(2 × b2), where b1 here refers to age and b2 refers to
agesquared. We thus have –​45 292/​(2 × –​726) = 31.19; the players reach
their top earnings potential at 31 years of age.

a.indd 105 27-Aug-19 15:04:24


106

106 The extensions

x1

Figure 6.3 A non-​linear relationship (i.e. an inverse U-​relationship) between x1 and y.

Stata-​output 6.3 Regression of earnings (earnings_​E) on player age (age) and


square of player age (agesquared).

. use "C:\Users\700245\Desktop\temp_myd\football_earnings.dta", clear

. gen agesquared = age * age

. reg earnings_E age agesquared

Source | SS df MS Number of obs = 242


-------------+---------------------------------- F(2, 239) = 17.29
Model | 2.7544e+11 2 1.3772e+11 Prob > F = 0.0000
Residual | 1.9036e+12 239 7.9648e+09 R-squared = 0.1264
-------------+---------------------------------- Adj R-squared = 0.1191
Total | 2.1790e+12 241 9.0416e+09 Root MSE = 89246

------------------------------------------------------------------------------
earnings_E | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 45292.13 12193.37 3.71 0.000 21271.93 69312.33
agesquared | -726.4627 222.5748 -3.26 0.001 -1164.922 -288.0039
_cons | -556343.5 163076.3 -3.41 0.001 -877594 -235093.1
------------------------------------------------------------------------------

Let’s take a moment. Why include a quadratic (squared term) in a regres-


sion in the first place? The answer is that we based on theory, prior research or
common sense suspect that the relationship between x1 and y has an inverse
U-​form or U-​form in real life. (Prior research on the determinants of soccer
players’ earnings does suggest this inverse U-​relationship.) The evidence of
such a non-​linear relationship is, in short, a significant regression coefficient
for the quadratic term. We thus keep a significant quadratic term in the regres-
sion, since this suggests a non-​linear relationship between x1 and y in real
life. If the quadratic term is insignificant, in contrast, we have no case for

a.indd 106 27-Aug-19 15:04:25


107

Beyond linear regression 107

600
Earnings in thousand euros
200 0 400

20 25 30 35 40
Age of player
earnthousands Linear regression
Quadratic regression

Figure 6.4 Comparison of linear and quadratic (i.e. non-​linear) regression between


age and earnings. 242 players in the Norwegian version of PL.
Graph generated in Stata.

non-​linearity and again prefer the simpler, linear model using Occam’s razor.
In our case, the b for agesquared is significant at p = 0.001, implying a non-​
linear relationship. Preferring a linear model over a non-​linear one when there
is evidence of non-​linearity amounts to the same kind of specification error as
holding on to an additive model when there is evidence of non-​additivity (see
section 6.2). Figure 6.4 contrasts the linear regression (solid regression line)
with the quadratic regression (dashed regression line).8
When the effect of x is positive and the effect of x2 is negative, the rela-
tionship between x and y has an inverse U-​form. When the effect of x is nega-
tive and the effect of x2 is positive, the relationship between x and y has the
U-​form. In the latter case the formula above yields a minimum point.9

6.4 Non-​linearity 2 (assumption B2): Sets of dummy variables


Non-​linearity between x1 and y can be addressed in various ways, of which
section 6.3 described one. Another approach is the use of sets of dummy
variables. This approach was actually shown in section 3.4, and we repeat and
adjust it here using the newly analyzed soccer player data. Again, the variables
of interest are age and earnings. The idea is to create a dummy for each age

a.indd 107 27-Aug-19 15:04:25


108

108 The extensions


group and then compare these age groups’ earnings to a reference age group.
(The age distribution of the players ranges from 18 to 39 years; not shown.)
The regression appears in Stata-​output 6.4. Before doing the regression,
I first created the five age group dummies age 21 to 23, age 24 to 26, age 27 to
29, age 30 to 32 and age 33 or more.10 I did not create the dummy for age 18
to 20, which thus serves as the reference group or -​category.

Stata-​output 6.4 Regression of earnings (earnings_​E) on age group dummies.

. reg earnings_E age21_23 age24_26 age27_29 age30_32 age33ormore

Source | SS df MS Number of obs = 242


-------------+---------------------------------- F(5, 236) = 6.30
Model | 2.5654e+11 5 5.1308e+10 Prob > F = 0.0000
Residual | 1.9225e+12 236 8.1461e+09 R-squared = 0.1177
-------------+---------------------------------- Adj R-squared = 0.0990
Total | 2.1790e+12 241 9.0416e+09 Root MSE = 90256

------------------------------------------------------------------------------
earnings_E | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age21_23 | 46350.58 22669.46 2.04 0.042 1690.229 91010.92
age24_26 | 86192.5 23134.42 3.73 0.000 40616.15 131768.8
age27_29 | 99667.94 23134.42 4.31 0.000 54091.59 145244.3
age30_32 | 102007.4 23223.17 4.39 0.000 56256.19 147758.6
age33ormore | 99896.96 24367.35 4.10 0.000 51891.65 147902.3
_cons | 40859.49 18819.6 2.17 0.031 3783.62 77935.36
------------------------------------------------------------------------------

The coefficient for age21_​23 says that players aged from 21 to 23 years
earn about 46 000 € more on average than players aged between 18 and
22 (the reference). The latter reference group earns about 41 000 € on a
yearly basis (the constant). The remaining dummy coefficients follow the
same pattern in terms of interpretation. For example, players aged from
30 to 32 years earn 102 000 € more than the reference category on average.
The non-​linearity in the earnings-​age association is present in that the three
oldest age groups (27–​29, 30–​32 and 33 or more) have rather equal annual
earnings. That is, the earnings do not increase steadily with aging as implied
by linearity.
In summary, both the quadratic and the sets of dummy variables approach
may be used to handle non-​linearity between x1 and y. The latter procedure in
general requires more observations, though. A third approach for addressing
non-​linearity, coming up now, is the use of logarithms.

6.5 Non-​linearity 3 (assumption B2): Use of logarithms


Let’s say we want to find out how the price of TVs is related to their size
in inches. Given access to data containing these variables, it is tempting to
regress price on inches in the usual manner shown so far in this book. And
this is exactly what is done in Stata-​output 6.5 for 121 TVs in the Norwegian
market. (The name of the data file is tv_​price.)

a.indd 108 27-Aug-19 15:04:26


109

Beyond linear regression 109


Stata-​output 6.5 Regression of TV price in NOK (price) on TV size in inches
(inches).

. use "C:\Users\700245\Desktop\temp_myd\tv_price.dta", clear

. reg price inches

Source | SS df MS Number of obs = 121


-------------+---------------------------------- F(1, 119) = 261.56
Model | 4.9239e+09 1 4.9239e+09 Prob > F = 0.0000
Residual | 2.2402e+09 119 18825282.1 R-squared = 0.6873
-------------+---------------------------------- Adj R-squared = 0.6847
Total | 7.1641e+09 120 59700977.7 Root MSE = 4338.8

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inches | 526.8323 32.57528 16.17 0.000 462.33 591.3346
_cons | -14299.58 1468.823 -9.74 0.000 -17208 -11391.17
------------------------------------------------------------------------------

Figure 6.5 TV price by TV size in inches, with scatterplot smoother. 121 TVs in the
Norwegian market.
Graph generated in Stata.

Since we are comparing different TVs, the regression coefficient suggests


that a 51-​inch TV costs 527 NOK more on average than a 50-​inch TV. The
independent variable explains 69% of the variation (variance) in TV prices.
But is this relationship in fact linear? The plot in Figure 6.5 examines this
using the scatterplot smoother introduced in section 5.3.

a.indd 109 27-Aug-19 15:04:27


110

110 The extensions


The answer is no. The relationship between the two variables is non-​linear.
But since the regression line increases in an accelerating way and has its min-
imum at the lowest x-​value (i.e. x = 18.5), the quadratic regression is not
optimal. In contrast, the (sets of) dummy variable approach might be used if
there are enough observations in the data. However, there is a more elegant
way to proceed: to use logarithms.
Using logarithms in regression analysis means using a transformation of
y or x1 (or both) instead of the original variables.11 Taking the natural loga-
rithm of a number is to crunch it through a formula to become a new number.
In this way, the logarithm of 1 becomes zero and the logarithm of 5 becomes
1.609. Similarly, the logarithm of 100 is 4.605 and the logarithm of 10 000 is
9.210. In other words, when large numbers are expressed on the logarithmic
scale they are much closer in magnitude to small numbers than when stated on
their original scale.12 This trite fact has some nice consequences for regression,
especially regarding how to solve non-​linearity. (Using logarithms goes by
many names, such as “logging a variable”, “using a log” or simply “logging”.
And when using the logarithm of y in a regression, we generally write log y or
ln y.) Stata-​output 6.6 shows the regression between the log of TV prices and
TV size in inches. Note that the logging procedure must precede the regression
command (“gen log_​ price = ln(price)”).
The only thing of note in the output is that the regression coefficient has
changed in magnitude. It is now ≈ 0.06 and implies that a 51-​inch TV costs 6%
more on average than a 50-​inch TV. This is also the general idea: When y is
expressed in logs, the regression coefficient refers to change in y in percent by
a one-​unit increase in x, i.e. b has a relative interpretation.

Stata-​output 6.6 Regression of log TV price (log_​price) on TV size in inches


(inches).

. gen log_price = ln(price)

. reg log_price inches

Source | SS df MS Number of obs = 121


-------------+---------------------------------- F(1, 119) = 539.23
Model | 58.6466019 1 58.6466019 Prob > F = 0.0000
Residual | 12.942466 119 .108760218 R-squared = 0.8192
-------------+---------------------------------- Adj R-squared = 0.8177
Total | 71.5890679 120 .596575566 Root MSE = .32979

------------------------------------------------------------------------------
log_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inches | .0574962 .002476 23.22 0.000 .0525934 .0623989
_cons | 6.256483 .1116436 56.04 0.000 6.035417 6.477548
------------------------------------------------------------------------------

Back to the problem. Does this solve the non-​linearity in Figure 6.5?
Figure 6.6 shows the analogous scatterplot smoother for the log of TV price
and TV size. It shows a fairly linear relationship between the two variables.
Logging y in regressions like this thus often amounts to “linearizing” a non-​
linear relationship between x1 and y.13 When should we use the log of y instead

a.indd 110 27-Aug-19 15:04:27


111

Beyond linear regression 111

Figure 6.6 TV price (logged) by TV size in inches, with scatterplot smoother. 121 TVs
in the Norwegian market.
Graph generated in Stata.

of y in a regression? As before, theory, prior research or common sense guide


us on the real-​life functional relationship between x1 and y. When this on
occasion provides us with few pointers, however, checking the matter empir-
ically –​as done above –​should be a straightforward (and mandatory?) enter-
prise. Generally, logging y is preferable if the real-​life functional relationship
between x1 and y is of one of the forms depicted in Figure 6.7.
The use of dummies when y is logged. In Stata-​output 6.7 the variable smart
(short for smart TV or not; yes = 1, no = 0) is added to the regression in Stata-​
output 6.6. Using the interpretation above we are tempted to say that smart
TVs cost 49% more than non-​smart TVs ceteris paribus. However, this inter-
pretation only holds when b is smaller than roughly ± 0.20. For a larger b, we
need a formula to get to the correct percentage interpretation (see Halvorsen
and Palmquist, 1980), namely

100 × (eb –​  1).

This yields 100 × (2.718280.494–​1) = 63.9% in our case. Smart TVs cost 64%
more than non-​smart TVs, controlling for TV size. (e0.494 equals taking the
antilogarithm of 0.494.)

a.indd 111 27-Aug-19 15:04:28


112

112 The extensions

x1

Figure 6.7 Dashed line implies b1 > 0 if y is logged, and solid line implies b1 < 0 if y
is logged.

Stata-​output 6.7 Regression of log TV price (log_​price) on TV size in inches


(inches) and the dummy smart TV (smart).

. reg log_price inches smart

Source | SS df MS Number of obs = 121


-------------+---------------------------------- F(2, 118) = 518.15
Model | 64.2707866 2 32.1353933 Prob > F = 0.0000
Residual | 7.31828133 118 .062019333 R-squared = 0.8978
-------------+---------------------------------- Adj R-squared = 0.8960
Total | 71.5890679 120 .596575566 Root MSE = .24904

------------------------------------------------------------------------------
log_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inches | .0480655 .0021158 22.72 0.000 .0438756 .0522554
smart | .4940654 .0518822 9.52 0.000 .3913245 .5968062
_cons | 6.38027 .085303 74.80 0.000 6.211347 6.549193
------------------------------------------------------------------------------

The last two regressions are called semi-​logarithmic regressions or semi-​


elasticity models. Their “cousins”, where x1 (and x2 and so on) and y are
logged, are called double-​logarithmic models. The only thing changing for
the latter models is the interpretation of the regression coefficient, which is
change in y in percent given a one-​percent increase in x.14 The last thing to
note before proceeding with logarithmic regression modeling is that logging
is only possible for strictly positive values. That is, the logarithm of 0 is not
defined.15

a.indd 112 27-Aug-19 15:04:28


113

Beyond linear regression 113

6.6 Mediation models


Except for in section 6.2, we have so far estimated multiple regressions of the
kind depicted in Figure 6.8. Here, the effects of x1 and x2 on y, as expressed
by b1 and b2 (i.e. the arrows pointing towards y), are controlled for each other
and share a correlation (i.e. the bidirectional arrow). That is, the effect of x1,
which often is of main interest, may be thought of as x1’s unique effect on y.
And similarly for x2. But what if the real-​life association between the three
variables looks more like the stylized model depicted in Figure 6.9?
Figure 6.9 suggests that x1 has a direct effect on y (solid arrow) as well as
an indirect effect on y via x2 (dashed and dotted arrow). If we were to estimate
this model by the usual multiple regression, however, the direct effect would
be the only effect picked up by the regression. That is, since the regression
would not pick up the indirect effect, it would underestimate the total effect of
x1 on y. (Total effect = direct effect + indirect effect.)
To examine (path) models like in Figure 6.9, we need to estimate a set
of different regression models. The following example illustrates for the

x1

x2

Figure 6.8 Traditional multiple regression model where the effects of the two inde-
pendent variables are controlled for each other. The bidirectional arrow
symbolizes the fact that the two independent variables share a correlation.

x1

x2

Figure 6.9 Path model where x1 affects y directly (solid arrow) and indirectly through
x2 (dashed arrow and then dotted arrow).

a.indd 113 27-Aug-19 15:04:29


114

114 The extensions

Parents’ exercise interest

Students’ hours of exercise

Students’ exercise interest

Figure 6.10 Path model where parents’ exercise interest affects students’ hours of exer-
cise directly (solid arrow) and indirectly through students’ exercise interest
(dashed and then dotted arrow).

simplest type of path model, i.e. the model in Figure 6.9. Variable x1 is
parents’ interest in exercising (par_​exe_​int), x2 is students’ interest in
exercising (exercise_​int) and y is students’ hours of exercise per week
(phys_​act_​h).16 The data are fem_​stud_​sport.dta. The two inde-
pendent variables are indexes ranging from 0 (minimum interest) to 6 (max-
imum interest). (See section 2.5 if you have forgotten about indexes.) The
path model is presented in Figure 6.10.
Three regressions must be carried out to capture the (a) total effect, (b) direct
effect and (c) indirect effect of parents’ exercise interest in Figure 6.10:

(1) Students’ hours of exercise on parents’ exercise interest only


(2) Students’ hours of exercise on parents’ exercise interest and students’
exercise interest
(3) Students’ exercise interest on parents’ exercise interest (dashed arrow)

Regression (1) estimates the total effect, and regression (2) estimates the
direct effect. The indirect effect (c) is found by multiplying the regression
coefficient for students’ exercise in (2) (dotted arrow) by the regression
coefficient in (3) (dashed arrow). The results of the regressions appear in
Stata-​output 6.8.
The top regression (1) suggests that the total effect is 0.465. The middle
regression (2) suggests a direct effect of 0.201. Finally, the indirect effect is
0.264 (1.327 × 0.199) based on the middle (2) and bottom (3) regression.
(Note again that total effect = direct effect + indirect effect.)
The upshot? Were we to examine the effect of parents’ exercise interest
on students’ hours of exercise by standard multiple regression, we would
suggest an effect of 0.201 (i.e. the direct effect). That is, we would under-
estimate the total effect of this variable, which is 0.465 (i.e. direct + indirect
effect).17

a.indd 114 27-Aug-19 15:04:29


115

Beyond linear regression 115


Stata-​output 6.8 Three sets of regressions to estimate the path model in Figure 6.10.

. use "C:\Users\700245\Desktop\temp_myd\fem_stud_sport.dta"

. reg phys_act_h par_exe_int

Source | SS df MS Number of obs = 293


-------------+---------------------------------- F(1, 291) = 11.54
Model | 122.394756 1 122.394756 Prob > F = 0.0008
Residual | 3085.52248 291 10.60317 R-squared = 0.0382
-------------+---------------------------------- Adj R-squared = 0.0348
Total | 3207.91724 292 10.9860179 Root MSE = 3.2563

------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
par_exe_int | .4652037 .136924 3.40 0.001 .1957168 .7346907
_cons | 1.505246 .4920277 3.06 0.002 .5368623 2.47363
------------------------------------------------------------------------------

. reg phys_act_h par_exe_int exercise_int

Source | SS df MS Number of obs = 293


-------------+---------------------------------- F(2, 290) = 62.01
Model | 960.895903 2 480.447952 Prob > F = 0.0000
Residual | 2247.02133 290 7.74834942 R-squared = 0.2995
-------------+---------------------------------- Adj R-squared = 0.2947
Total | 3207.91724 292 10.9860179 Root MSE = 2.7836

------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
par_exe_int | .2007574 .1197773 1.68 0.095 -.0349856 .4365005
exercise_int | 1.32725 .1275867 10.40 0.000 1.076137 1.578363
_cons | -3.634039 .6488283 -5.60 0.000 -4.911048 -2.357029
------------------------------------------------------------------------------

. reg exercise_int par_exe_int

Source | SS df MS Number of obs = 293


-------------+---------------------------------- F(1, 291) = 13.73
Model | 22.4515335 1 22.4515335 Prob > F = 0.0003
Residual | 475.990446 291 1.635706 R-squared = 0.0450
-------------+---------------------------------- Adj R-squared = 0.0418
Total | 498.44198 292 1.70699308 Root MSE = 1.2789

------------------------------------------------------------------------------
exercise_int | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
par_exe_int | .1992438 .0537792 3.70 0.000 .0933983 .3050893
_cons | 3.872131 .1932521 20.04 0.000 3.491782 4.25248
------------------------------------------------------------------------------

In many applications the x2 variable –​here: students’ exercise interest –​is


called a mediator. Another term for it is a mechanism, i.e. a variable that (at least
partially) explains the statistical relationship between x1 and y. Here, students’
exercise interest (x2) explains 57% of the association between parents’ exer-
cise interest (x1) and students’ hours of exercise, y ((0.4652–​0.2007)/​0.4652 ≈
0.569). In this sense, x2 is a partial mediator.
When do we employ a path (mediator) model instead of the usual multiple
regression model? The answer is when there is a sounder theoretical justifica-
tion for the former in terms of what happens in real life. The purpose of this

a.indd 115 27-Aug-19 15:04:30


116

116 The extensions


section has been to provide the reader with a first take on path models using
a simple example. However, the principle shown holds for models including
several controls as well as several potential mediators. Of course, in these
expanded models, the mathematical and statistical difficulties increase, but to
accommodate those we have statistics programs at our disposal.
In certain textbooks, especially those going by the name of econometrics,
the x2 or mediator is called a “bad control variable”.18 A bad control is a vari-
able that lies between the independent variable of interest, x1, and y in the
causal chain or process going from left to right in the above figures. That is,
including a bad control in a regression will underestimate the total effect of x1
on y. For this reason, the advice goes, bad controls must be discarded from
the regression. Another and perhaps more constructive approach would be to
use a path (mediator) model instead.

6.7 Summary and further reading


Chapter 6 first dealt with the regression assumption of additivity, which says
that the coefficients in a multiple regression are of the same size for every sub-
group of observations. When this assumption does not sit well with reality, we
instead opt for a regression that includes one or more interaction terms. (This
is often called a moderator or multiplicative regression model.) When this
interaction term’s regression coefficient is statistically significant (suggesting a
violation of additivity), we prefer the interaction regression model. When the
interaction term’s regression coefficient is insignificant, in contrast, we stick
to the simpler, additive regression model based on the principle of parsimony.
The middle of Chapter 6 illustrated three ways in which to handle a non-​
linear relationship between x1 and y, i.e. a violation of the linearity assumption
in linear regression analysis. The three strategies offered are sometimes
equivalent and sometimes not. When there are no signs of non-​linear effects,
we prefer the simple, linear model in keeping with the parsimony principle.
The final part of Chapter 6 introduced mediator or path models. These
models are to be preferred when we have reason to believe that our inde-
pendent variable of interest, x1, also has indirect effects on y working through
other independent variables, such as x2.
Interaction or moderator regression models are covered in most of the
books and papers mentioned in the “further reading” sections so far. Mitchell
(2012) is a very useful and graphically oriented book on the subject using
Stata. Non-​linearity is also covered in every textbook. Baron and Kenny
(1986) is the standard reference to mediator models, and a recent and very
readable review and introduction to the subject is given by Pieters (2017).

Notes
1 In real-​life academic settings (yes, there is such a thing!), several controls would
be included in this regression. I omit these here, and in the remaining chapter

a.indd 116 27-Aug-19 15:04:30


117

Beyond linear regression 117


examples, to highlight the procedures of interest. The only complexity introduced
by adding controls is that the regression coefficient of main interest might change
in magnitude as we saw it did in Chapter 3.
2 This procedure is the general one that works for all statistics programs, with small
variations. Stata also has a built-​in command that does these analyses in one
step, as in:
“reg phys_​act_​h i.gender##c.friend_​phys”.
That is, the “##”-​command tells Stata to multiply the two variables and add the
resulting interaction term to the regression.
3 Since men are coded 0 in the data, the value of the product term variable for them
also becomes 0. (Multiplying a number with 0 always yields 0.) Since women are
coded 1 in the data, the value of the product term variable for them becomes –​0.79.
4 The command for generating the graph is:
reg phys_​act_​h i.gender##c.friend_​phys
margins gender, at(friend_​ phys=(0 4))
marginsplot, noci title(Students’ hours of exercise per
week) xtitle(Friends’ exercise level level (0 = lowest;
4 = highest)) ytitle(Hours per week)
5 A regression may contain several interaction variables (product terms), something
which makes interpretation harder (but see Mitchell, 2012). Remember always to
keep the original variables in the regression, i.e. x1 and x2 and x1 * x2. (Remember
also that * is the Stata-​convention for the multiply sign, ×.)
6 Interaction terms may also be created between two dummies or two numerical
variables, and they are roughly interpreted along the lines of the one presented
here (for more on this, see Mitchell, 2012). The general idea is that the effect of
x1 on y is conditional on the level of another independent variable, x2. Interaction
models are also called multiplicative models in the literature.
7 I once more omit x2 and x3 and so on for the sake of exposition, but it goes without
saying that non-​linearity issues could (and perhaps often should) be examined for
every independent variable in the regression model, except for dummies.
8 The command for generating the graph is:
gen earnthousands = earnings_​
E/​
1000
graph twoway (scatter earnthousands age) (lfit
earnthousands age) (qfit earnthousands age), xtitle(“Age
of player”) ytitle(“Earnings, thousands of Euro”)
9 There are more complex non-​linear relationships, such as a cubic one. This requires
the use of x, x2 and x3 (i.e. x * x * x) in the regression. But in practice the quadratic
specification is by far the most commonly applied. Related but more complicated
procedures involve piecewise regression models and spline models; see Lohmann
(2015) for this.
10 The commands are:
gen age21_​ 23 = age
gen age24_​ 26 = age
gen age27_​ 29 = age
gen age30_​ 32 = age

a.indd 117 27-Aug-19 15:04:30


118

118 The extensions

gen age33ormore = age


recode age21_​
23 21/​
23 = 1 else = 0
recode age24_​
26 24/​
26 = 1 else = 0
recode age27_​
29 27/​
29 = 1 else = 0
recode age30_​
32 30/​
32 = 1 else = 0
recode age33ormore 33/​
39 = 1 else = 0

11 Again, x2 and x3 and so on are omitted for the sake of exposition.


12 A user-​friendly introduction to logarithms for non-​mathematicians is given by
Studenmund (2017).
13 Logging y to get a more linear relationship between x1 and y quite often also
solves another issue. Not shown here, the linear regression in Stata-​output 6.5
has severe heteroscedasticity problems. In contrast, the logarithmic regression in
Stata-​output 6.6 has no such problems. See King and Roberts (2015) for more
on this.
14 A note on terminology: Sometimes the linear or plain vanilla model is called the
lin-​lin model, whereas the semi-​logarithmic model is called log-​lin. The double-​
logarithmic one is thus called log-​log. There is also a fourth and less used semi-​
logarithmic variant, i.e. lin-​log. The latter is also called a production function in
economics (see Studenmund, 2017). As always, theory, prior research and common
sense (as well as the idiosyncrasies of the data in question) guide us regarding
which of these functional forms (i.e. type of relationships) to prefer.
15 Dummy independent variables may be logged if they first are coded as 2.718228
and 1. Logging these values then yields 1 and 0.
16 More extensive models include control variables, such as age and gender. They
could also include more variables similar to students’ exercise interest, like
attitudes towards dietary habits or smoking behavior.
17 Whether or not the indirect effect is statistically significant may be of interest. This
requires hard calculations by hand, but fortunately Stata has an add-​on that may
be downloaded from the Internet (write “findit medeff ” in the command window
of Stata, press Enter and follow the instructions). The command (see the help file
for details; “help medeff ”) and results appear below. We note that the indirect
effect, called “ACME”, has a 95% confidence interval ranging from about 0.12 to
0.44, making it (indirectly) statistically significant.

a.indd 118 27-Aug-19 15:04:30


119

Beyond linear regression 119


. medeff (regress exercise_int par_exe_int) (regress phys_act_h par_exe_int
exercise_int), mediate(exercise_int) treat( par_exe_int)
Using 0 and 1 as treatment values
Source | SS df MS Number of obs = 293
-------------+---------------------------------- F(1, 291) = 13.73
Model | 22.4515335 1 22.4515335 Prob > F = 0.0003
Residual | 475.990446 291 1.635706 R-squared = 0.0450
-------------+---------------------------------- Adj R-squared = 0.0418
Total | 498.44198 292 1.70699308 Root MSE = 1.2789
------------------------------------------------------------------------------
exercise_int | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
par_exe_int | .1992438 .0537792 3.70 0.000 .0933983 .3050893
_cons | 3.872131 .1932521 20.04 0.000 3.491782 4.25248
------------------------------------------------------------------------------
Source | SS df MS Number of obs = 293
-------------+---------------------------------- F(2, 290) = 62.01
Model | 960.895903 2 480.447952 Prob > F = 0.0000
Residual | 2247.02133 290 7.74834942 R-squared = 0.2995
-------------+---------------------------------- Adj R-squared = 0.2947
Total | 3207.91724 292 10.9860179 Root MSE = 2.7836
------------------------------------------------------------------------------
phys_act_h | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
par_exe_int | .2007574 .1197773 1.68 0.095 -.0349856 .4365005
exercise_int | 1.32725 .1275867 10.40 0.000 1.076137 1.578363
_cons | -3.634039 .6488283 -5.60 0.000 -4.911048 -2.357029
------------------------------------------------------------------------------
Effect | Mean [95% Conf. Interval]
-------------------------------+-------------------------------------------------
ACME | .2686881 .1188179 .4354832
Direct Effect | .200825 -.0376287 .4361173
Total Effect | .4695132 .1875112 .7484893
% of Tot Eff mediated | .5695042 .3589794 1.432942
---------------------------------------------------------------------------------

18 A more precise, formal name is an endogenous independent variable (more on this


in Chapter 9).

a.indd 119 27-Aug-19 15:04:31


120

7 
A categorical dependent variable
Logistic (logit) regression
and related methods

7.1 Introduction
The y-​variables in this book so far have been numerical, i.e. variables with
a long range of numbers as possible values. In such cases, linear regression
along with its extensions presented in Chapter 6 most often works very well,
even in the presence of mild violations of the regression assumptions. It is
thus said that linear regression is a robust method. Quite often, however,
research involves an y-​variable with a few, categorical values. A special case is
the binary variable; what we call a dummy. This chapter begins with analyzing
a dummy y and finishes off with the analysis of a y with three unordered
values/​categories (see section 7.6).
Dummy dependent variables flourish in the social and economic sciences.
Here are some examples:

• Some people choose to vote in elections, whereas others don’t. Political


scientists are interested in how independent variables such as age, gender
and income affect this voting choice (yes = 1, no = 0).
• Some firms go bankrupt within a few years of startup, whereas others
don’t. Economists are interested in how independent variables such as
type of funding, type of organizational structure and gender of CEO
affect going bankrupt or not (yes = 1, no = 0).
• Some are employed full time, whereas others aren’t. Sociologists are
interested in how independent variables like education, working experi-
ence and family obligations affect this employment “choice” (yes = 1,
no = 0).1
• Some try to commit suicide, whereas most others (thankfully) don’t.
Psychologists are interested in how independent variables like type of
upbringing, personality traits and self-​esteem affect this “choice” (yes = 1,
no = 0).

And the list goes on… In short, we are talking about dependent variables that
might be thought of as the presence (1) or no-​presence (0) of something. Or,
in the framework of questionnaires, a question that directly or indirectly may
be answered with yes (1) or no (0).2

a.indd 120 27-Aug-19 15:04:31


121

Logistic (logit) regression 1 121


It has been customary for years to use logistic (often called logit) regression
for analyzing a binary y. But of late it has become more popular to use the
plain vanilla linear regression model even when y is binary. For this reason,
this chapter starts with the so-​called linear probability model (LPM), i.e. the
regression model we have seen so far on a binary y (section 7.2). Section 7.3
then uses the standard logistic (logit) regression model on the same example
data as in section 7.2. Section 7.4 takes up the issue of interpreting and
presenting the results of the logistic regression model, which is no small feat.
Interaction effects, non-​linearity and model fit are discussed in section 7.5.
Section 7.6 presents a regression model for an extension in which y is a choice
between three alternatives, and section 7.7 addresses some further issues and
points to further reading.

7.2 The linear probability model (LPM)


Scenario. A new fitness center is under construction in the vicinity of a univer-
sity college. A random sample of students are in a questionnaire asked about
how they would prefer to pay for exercising in this center. The options are (i)
a fixed monthly fee with no restrictions on number of times exercising or (ii)
paying a small fee per time for each exercise. (Both alternatives included cost
information and were set up in such a way that exercising for two times per
week amounted to the same monthly cost.) The frequency distribution and
pm; yes = 1, no = 0) appear in
descriptive statistics for this variable (fixed_​
Stata-​output 7.1. The data are called pay_​ pref_​fit.

Stata-​output 7.1 Descriptive statistics for the dummy variable payment preference


pm).
(fixed_​

. use "C:\Users\700245\Desktop\temp_myd\pay_pref_fit.dta", clear

. tab fixed_pm

fixed_pm | Freq. Percent Cum.


------------+-----------------------------------
no | 181 27.38 27.38
yes | 480 72.62 100.00
------------+-----------------------------------
Total | 661 100.00

. sum fixed_pm

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
fixed_pm | 661 .7261725 .4462592 0 1

In the upper part of the output we see that about 73% of the students
prefer to pay a fixed amount per month. Conversely, 27% prefer paying per
time. The lower part tells us the same thing, but expressed as a probability or
chance. The probability (chance) that a random student prefers a fixed price

a.indd 121 27-Aug-19 15:04:31


122

122 The extensions


payment structure is about 0.73. Note that the mean of a 0/​1 variable always
is the proportion with the value of 1 on the variable.
Research question. The research question involves how two x-​variables affect
this dummy y, i.e. how one option is preferred to the other. These x-​variables
are weekly hours of exercise (phys_​act_​h) and whether or not they pres-
ently are fitness club members (fitness_​ c; yes = 1, no = 0). We want to
know, without formalizing any hypotheses, how more exercising per week and
being a fitness center member (as opposed to not) affect the probability of
preferring a fixed payment structure. The linear regression/​linear probability
model (LPM) in Stata-​output 7.2 sheds light on this.

Stata-​output 7.2 Linear regression/​LPM of payment preference (fixed_​ pm) on


weekly hours of exercise (phys_​act_​h) and fitness center member-
ship (fitness_​c).

. reg fixed_pm phys_act_h i.fitness_c

Source | SS df MS Number of obs = 661


-------------+---------------------------------- F(2, 658) = 40.98
Model | 14.5599182 2 7.27995912 Prob > F = 0.0000
Residual | 116.877298 658 .177625073 R-squared = 0.1108
-------------+---------------------------------- Adj R-squared = 0.1081
Total | 131.437216 660 .199147297 Root MSE = .42146

------------------------------------------------------------------------------
fixed_pm | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .0268293 .0055132 4.87 0.000 .0160037 .0376548
|
fitness_c |
yes | .1863895 .0353186 5.28 0.000 .1170388 .2557403
_cons | .5078247 .0300811 16.88 0.000 .4487583 .5668912
------------------------------------------------------------------------------

The exercise effect suggests that exercising for, say, six hours per week
entails a 2.7 percentage points larger probability of preferring a fixed fee than
exercising for five hours. Similarly, present fitness center members have an
18.6 percentage points larger probability of preferring a fixed fee than non-​
members. Both effects are significant and controlled for each other.
The LPM has intuitive appeal, as it requires nothing we haven’t already
learned. But it also has some potential problems. The first is that the data
points will not lie close to the linear regression line (i.e. poor model fit).
The second is that predictions from the LPM might be smaller than 0 or
larger than 1, something which is impossible for a proportion. That is,
the probability of something occurring (like preferring fixed pay) must
always be in the 0–​1 region, just as a proportion is always between 0 and
100%. The third and minor problem is that the error term, e, always will be
heteroscedastic and non-​normally distributed. We’ll get back to whether
or not these issues are serious enough to abandon the LPM altogether in
section 7.4.

a.indd 122 27-Aug-19 15:04:32


123

Logistic (logit) regression 1 123

7.3 The standard multiple logistic (logit) regression model


Before going into the technicalities of logistic regression analysis, Stata-​output
7.3 first presents the logistic (logit) regression command and the results for the
data analyzed in Stata-​output 7.2.
pm) on weekly
Stata-​output 7.3 Logistic regression of payment preference (fixed_​
hours of exercise (phys_​act_​h) and fitness center membership
(fitness_​ c).

. logit fixed_pm phys_act_h i.fitness_c

Iteration 0: log likelihood = -388.026


Iteration 1: log likelihood = -349.56718
Iteration 2: log likelihood = -347.82746
Iteration 3: log likelihood = -347.82093
Iteration 4: log likelihood = -347.82093

Logistic regression Number of obs = 661


LR chi2(2) = 80.41
Prob > chi2 = 0.0000
Log likelihood = -347.82093 Pseudo R2 = 0.1036

------------------------------------------------------------------------------
fixed_pm | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .1817575 .037516 4.84 0.000 .1082274 .2552875
|
fitness_c |
yes | .9830869 .2037055 4.83 0.000 .5838314 1.382342
_cons | -.1938905 .1652578 -1.17 0.241 -.5177898 .1300088
------------------------------------------------------------------------------

Two features are of note in the output. First, both independent variables
have significant effects on the pay preference dummy; z-​values correspond
roughly to t-​values.3 Second, the signs of the two logistic regression coefficients
(+) tell us that students exercising a lot have a larger probability of preferring
a fixed fee than those exercising little, and that current fitness center members
have a larger probability of preferring a fixed fee than non-​members. Both
findings are in keeping with the results of the LPM in Stata-​output 7.2. But
other than this, the logistic coefficients unfortunately do not tell us much
about the magnitude of these effects –​as opposed to the similar effects in the
LPM. Why? The answer is to be found in the upcoming mathematical proper-
ties of the logistic (logit) regression model.
The statistical model actually estimated in logistic (logit) regression is

 p 
Ln  = b0 + b1 x1 + b2 x2 + bn xn ,
 1 − p 

where Ln means “the logarithm of ”, and p is the probability of y = 1 (here: pay


preference = fixed fee). That is, the left-​hand side of the equation is a non-​linear
function of the independent variables. This entails that the logistic regression

a.indd 123 27-Aug-19 15:04:33


124

124 The extensions

Figure 7.1 A sigmoid relationship between x1 and y.

line is not linear but sigmoid/​S-​shaped, as illustrated in Figure 7.1 in the case
with one independent variable, x1. The non-​linear transformation thus solves
the model fit problem (partly) and the prediction problem, which are now in
the 0–​1 region. But this rescue operation comes at a price we have seen: non-​
informative coefficients, i.e. coefficients not having the “if x goes up with one
unit, y changes with… interpretation”. On the contrary, the figure shows that
the effect of x1 –​i.e. “if x1 goes up with one unit, y changes with…” –​is not
constant, but contingent on where on the x-​axis the increase in x1 takes place.

7.4 Interpreting and presenting the results of the logistic (logit)


regression model
The non-​linearity embedded in the mathematical properties of the logistic
regression model is the reason why logistic regression coefficients have no
translucent interpretation. And for this reason we have to make them interpret-
able to ourselves and to our readers. Several possibilities exist in this regard,
and I present two of them: marginal effects and predicted probabilities.4
Marginal effects. The marginal effect approach, to simplify, focuses on the
area in Figure 7.1 where the S-​shaped logistic regression line is very close to
its linear counterpart, i.e. in the middle region of the x-​axis. In this region
the marginal effect has the linear-​approximation interpretation “change in
y in percentage points given a one-​unit increase in x1”. After the logistic
regression command in Stata-​output 7.3, one writes the command asking for

a.indd 124 27-Aug-19 15:04:33


125

Logistic (logit) regression 1 125


marginal effects. Stata-​output 7.4 repeats the logit regression in Stata-​output
7.3 and shows the command for obtaining the marginal effects (“margins,
dydx(*)”) and the results.

pm) on weekly
Stata-​output 7.4 Logistic regression of payment preference (fixed_​
hours of exercise (phys_​act_​h) and fitness center membership
(fitness_​ c), with marginal effects.

. logit fixed_pm phys_act_h i.fitness_c

Output omitted …

Logistic regression Number of obs = 661


LR chi2(2) = 80.41
Prob > chi2 = 0.0000
Log likelihood = -347.82093 Pseudo R2 = 0.1036

------------------------------------------------------------------------------
fixed_pm | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .1817575 .037516 4.84 0.000 .1082274 .2552875
|
fitness_c |
yes | .9830869 .2037055 4.83 0.000 .5838314 1.382342
_cons | -.1938905 .1652578 -1.17 0.241 -.5177898 .1300088
------------------------------------------------------------------------------

. margins, dydx(*)

Average marginal effects Number of obs = 661


Model VCE : OIM

Expression : Pr(fixed_pm), predict()


dy/dx w.r.t. : phys_act_h 1.fitness_c

------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .031917 .0062164 5.13 0.000 .0197331 .0441009
|
fitness_c |
yes | .1772555 .0357648 4.96 0.000 .1071577 .2473533
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

The marginal effects appear in the “dy/​dx” column.5 Students exercising


for, say, six hours per week have a 3.2 percentage points larger probability
of preferring fixed pay than students exercising for five hours a week. (Two
more hours = 6.4 percentage points and so on.) Also, present fitness center
members have a 17.7 percentage points larger probability of preferring fixed
pay than non-​members. The “similar” marginal effects for the LPM were 2.7
and 18.6 (see Stata-​output 7.2). Both marginal effects are controlled for each
other as in the multiple linear regression case.6
The linear probability model (LPM) versus logistic regression. Section 7.2
mentioned that the linear regression/​the LPM was not ideal for examining a
binary y. In the present case, however, the results of the LPM and the logistic

a.indd 125 27-Aug-19 15:04:33


126

126 The extensions


regression gave qualitatively similar results. This also seems to be the general
case. For this reason, it has become customary, especially among economists
(see Angrist and Pischke, 2009), to actually prefer the LPM when y is binary.
My personal take is also that the very translucent linear regression model/​
LPM works fine in most cases, especially when the proportion of y = 1 lies in
the 20 to 80% region.
The marginal effect approach’s merit is ease of interpretation and com-
munication; it is a single and transparent figure. On the downside, the base
probability –​that 73% in our case prefer the fixed pay form –​often gets lost
along the way. This problem is solved in the predicted probabilities approach
coming up next.
Predicted probabilities. The name of this approach says it all. Plug different
values for the x-​variables into the logistic regression equation and calculate
the predicted probability of y = 1. In this regard we should use theoretic-
ally relevant and/​or representative x-​values to construct probabilities. What
is the predicted probability of preferring fixed pay for a student exercising
four hours per week who at present is not a fitness center member? We get the
answer by plugging the respective values (i.e. 4 and 0) into the equation (see
Menard, 1995, p. 13)

eb0 + b1 x1 + b2 x2
.
1 + eb0 + b1 x1 + b2 x2

This yields (see the constant and the logistic regression coefficients in Stata-​
outputs 7.3 or 7.4)

e −0.1939 + 0.1818(4)+ 0.9831(0)) e0.5333 1.7045


−0.1939 + 0.1818( 4) + 0.9831( 0) )
= = = 0.6302 � ≈ 63%.7
1+ e 1 + e0.5333 2.7045

A student presently not member of a fitness center who exercises for four
hours per week has a 63% probability of preferring fixed pay. What if the
“same” student is a fitness center member? By replacing 0 with 1 in the
equation above (i.e. 0.19831 × 1), we get ≈ 82%. The otherwise similar student
has an 82% probability of preferring a fixed pay form, i.e. a 19 percentage
points larger probability. To get the predicted probability for students exer-
cising very much, say 12 hours per week, we replace 4 with 14 in the equation
and so on.
To do the calculations above you don’t need a statistics program. But
Excel might help. In Stata these calculations may be done semi-​automatically.
Stata-​output 7.5 illustrates for four hours of exercise and not being a fitness
center member. (Immediately after the logistic (logit) regression you write the
command “margins, at(phys_​ act_​h=4 fitness_​ c=0)”.)

a.indd 126 27-Aug-19 15:04:35


127

Logistic (logit) regression 1 127


Stata-​output 7.5 Predicted probability of y = 1 for selected values of the independent
variables.

. margins, at(phys_act_h=4 fitness_c=0)

Adjusted predictions Number of obs = 661


Model VCE : OIM

Expression : Pr(fixed_pm), predict()


at : phys_act_h = 4
fitness_c = 0

------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | .630215 .0277869 22.68 0.000 .5757537 .6846763
------------------------------------------------------------------------------

As we see, Stata yields the same result (i.e. 0.6302) as did our hand
calculation above.
In most practical applications of logistic regression there are several inde-
pendent variables and not only two. This makes computing of predicted
probabilities more cumbersome by hand, but the principle shown above still
applies. Also, since only one or two independent variables are of main interest
in most settings, control variables may often be set to their mean/​median
values. A possible exception is dummy controls, where means often are mean-
ingless.8 Predicted probabilities for different values of a numerical or ordinal
x-​variable may also be put into an illustrative graph.

7.5 Interaction models, non-​linearity and model fit


Interaction models. Interaction (moderator) logistic regression models are
estimated and interpreted in the same manner as in the linear regression
case (see section 6.2). That is, one adds an interaction term to the additive
model. If this term has a significant effect, the interaction/​multiplicative
model is preferred to the additive model. In contrast, if the interaction
term is insignificant, the additive model is kept. We continue with the pay
preference case from sections 7.2/​7.3, but the independent variables are
now hours of exercise and gender (1 = women, 0 = men). The product
of these two independent variables is first generated and then added as a
third independent variable to the logistic regression model; Stata-​output
7.6 illustrates.

a.indd 127 27-Aug-19 15:04:35


128

128 The extensions


pm) on weekly
Stata-​output 7.6 Logistic regression of payment preference (fixed_​
hours of exercise (phys_​act_​h) and gender (gender) and their
product variable (womphys).

. gen womphys = phys_act_h * gender

. logit fixed_pm phys_act_h gender womphys

Iteration 0: log likelihood = -388.026


Iteration 1: log likelihood = -358.08102
Iteration 2: log likelihood = -356.88998
Iteration 3: log likelihood = -356.88651
Iteration 4: log likelihood = -356.88651

Logistic regression Number of obs = 661


LR chi2(3) = 62.28
Prob > chi2 = 0.0000
Log likelihood = -356.88651 Pseudo R2 = 0.0803

------------------------------------------------------------------------------
fixed_pm | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .1783637 .0496193 3.59 0.000 .0811116 .2756157
gender | -.2649659 .345209 -0.77 0.443 -.9415631 .4116312
womphys | .1431828 .0721017 1.99 0.047 .001866 .2844995
_cons | .0666088 .2790297 0.24 0.811 -.4802794 .6134969
------------------------------------------------------------------------------

Since the logistic regression coefficients have no intuitive metric, the main
concern is the sign and significance level of the interaction term. The fact
that it is significant (p < 0.05) tentatively suggests that the effect of hours
of exercise on preference for a fixed pay structure is different for men and
women.9 The sign of the interaction term womphys (+) indicates that the
exercise effect is stronger (0.321; 0.178 + 0.143 = 0.321) for women than for
men (0.178). A graph aids interpretation; see Figure 7.2. Note the somewhat
steeper regression line for women (black line). Whether or not this gender
difference is large enough to warrant attention is another matter, however.
Note also that the two non-​linear regression lines are a direct consequence
of the inherent non-​linearity in the logistic regression model. This does
not imply that these relationships are non-​linear in real life. To examine
the latter, we must address non-​linearity explicitly. This is the next topic
coming up.
Non-​linearity. In the same vein as for linear regression models, the relation-
ship between x1 and a dummy y may be non-​linear in real life. Two of three
procedures illustrated in Chapter 6 are relevant in the logistic case: quad-
ratic models and sets of dummy variables. (The use of logarithms is not
viable, since this feature is already embedded in the logistic model.) The

a.indd 128 27-Aug-19 15:04:35


129

Logistic (logit) regression 1 129

Pref. for fixed pay form by exercise hours and gender


1
.8
Probability
.6
.4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Hours of exercise per week

male female

Figure 7.2 Visual illustration of interaction effect between gender and hours of exer-
cise per week, based on the logistic (logit) regression model reported in
Stata-​output 7.6.

procedures are exactly the same in the linear and the logistic case, but
one should in the latter case also transform the logistic coefficients into
marginal effects and/​or predicted probabilities or graphs to get them
interpretable.
Model fit. Model fit could be relevant for logistic regression models, but not so
much as in the linear regression case. The reason is that there is no universal
measure of model fit, such as R2. In contrast, there are several pseudo R2
measures available. The default in Stata is McFadden’s pseudo R2; see Stata-​
output 7.6. This roughly suggests that the independent variables explain 8%
of the “variation” in pay preference form. Using the command “fitstat”
immediately after the logistic regression model yields a suite of different
pseudo R2 measures, as illustrated in Stata-​output 7.7. The pseudo R2 values
range from 3 to 17% in the present case.

a.indd 129 27-Aug-19 15:04:36


130

130 The extensions


Stata-​output 7.7 Different measures of model fit (pseudo R2 measures) for the logistic
regression model in Stata-​output 7.6.

. fitstat

| logit
-------------------------+-------------
Log-likelihood |
Model | -356.887
Intercept-only | -388.026
-------------------------+-------------
Chi-square |
Deviance(df=657) | 713.773
LR(df=3) | 62.279
p-value | 0.000
-------------------------+-------------
R2 |
McFadden | 0.080
McFadden(adjusted) | 0.070
McKelvey & Zavoina | 0.170
Cox-Snell/ML | 0.090
Cragg-Uhler/Nagelkerke | 0.130
Efron | 0.095
Tjur's D | 0.093
Count | 0.735
Count(adjusted) | 0.033
-------------------------+-------------
IC |
AIC | 721.773
AIC divided by N | 1.092
BIC(df=4) | 739.748
-------------------------+-------------
Variance of |
e | 3.290
y-star | 3.964

7.6 Extension: When y has three or more unordered alternatives


(multinomial logistic regression)
Sometimes choice variables have more than two unordered alternatives or
options, such as choosing between three president candidates or between
walking, cycling or going by car to work/​campus. In the same fashion as
above, independent variables may affect how one alternative is preferred to
the other two. We continue with the data on payment preference for exercising
in a fitness center, which in sections 7.2 and 7.3 was simplified to two payment
options. In reality there were three: (i) a fixed monthly fee, (ii) pay per time
and (ii) a mixed payment scheme, i.e. a smaller fixed fee than in (i) and a larger
cost per time than in (ii). In the analyses above, options (ii) and (iii) were
collapsed into one alternative. The distribution for the original payment pref-
erence variable appears in Stata-​output 7.8.

a.indd 130 27-Aug-19 15:04:36


131

Logistic (logit) regression 1 131


Stata-​output 7.8 Descriptive statistics for the variable payment preference (payment_​
pref).

. tab payment_pref

payment_pref | Freq. Percent Cum.


----------------+-----------------------------------
pay per time | 118 17.85 17.85
mixed | 63 9.53 27.38
fixed per month | 480 72.62 100.00
----------------+-----------------------------------
Total | 661 100.00

As before, 73% of the students prefer the fixed payment scheme. The new
information is how those not preferring fixed pay (i.e. 27%) are divided among
paying per time exclusively or preferring a mixed payment scheme. The former
is most popular with almost 18% as opposed to just under 10%. The question
now is how hours of weekly exercise, present fitness center membership and
gender affect the choice among the three payment options. When a choice
variable has three or more alternatives or options, we prefer the multinomial
logistic regression model and not the “ordinary” logistic.10 The command and
the results are displayed in Stata-​output 7.9.
Interpretation of multinomial logistic models is not easy to begin with.
The trick is to note that one of the options serves as baseline in all coef-
ficient assessments. In the analysis above, this reference (i.e. base out-
come) is the exclusive pay per time option. This means that paying a fixed
fee per month (lower part of the output) or preferring a mixed payment
scheme (upper part of the output) should both be contrasted with the pay
per time option.11 Starting with the lower part of the output, we see that
students exercising a lot have a larger probability of preferring a monthly
fixed fee rather than paying per time than those exercising little (i.e. a posi-
tive and significant coefficient). Similarly, present fitness center members
have a larger probability than non-​members of preferring a monthly fixed
fee rather than paying per time (i.e. a positive and significant coefficient). In
contrast, there is no gender difference in this regard since the gender coeffi-
cient is insignificant. Turning to the upper part of the output, we have that
students exercising a lot are close to having a larger probability of preferring
a mixed pay scheme rather than paying per time than those who exercise
little (i.e. a positive and almost significant coefficient; p = 0.056). We also
note that present fitness center members have a larger probability of pre-
ferring a mixed pay option rather than paying per time than non-​members
(i.e. a positive and significant coefficient). Finally, women are less inclined
than men to prefer a mixed pay scheme as opposed to paying per time (i.e. a
negative and significant coefficient).

a.indd 131 27-Aug-19 15:04:37


132

132 The extensions


Stata-​output 7.9 Multinomial logistic regression of payment preference (payment_​
pref) on weekly hours of exercise (phys_​act_​h), fitness center
membership (fitness_​ c) and gender (gender).

. mlogit payment_pref phys_act_h i.fitness_c i.gender, base(0)

Iteration 0: log likelihood = -504.99568


Iteration 1: log likelihood = -457.70119
Iteration 2: log likelihood = -453.37223
Iteration 3: log likelihood = -453.33405
Iteration 4: log likelihood = -453.33403

Multinomial logistic regression Number of obs = 661


LR chi2(6) = 103.32
Prob > chi2 = 0.0000
Log likelihood = -453.33403 Pseudo R2 = 0.1023

---------------------------------------------------------------------------------
payment_pref | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
pay_per_time | (base outcome)
----------------+----------------------------------------------------------------
mixed |
phys_act_h | .1240418 .0649641 1.91 0.056 -.0032854 .251369
|
fitness_c |
yes | .8417288 .367798 2.29 0.022 .120858 1.5626
|
gender |
female | -.8424532 .3355947 -2.51 0.012 -1.500207 -.1846996
_cons | -.7616118 .3593536 -2.12 0.034 -1.465932 -.0572918
----------------+----------------------------------------------------------------
fixed_per_month |
phys_act_h | .2446246 .0485079 5.04 0.000 .149551 .3396983
|
fitness_c |
yes | 1.314994 .2644774 4.97 0.000 .7966278 1.83336
|
gender |
female | -.1018814 .2497513 -0.41 0.683 -.591385 .3876222
_cons | .0339501 .2729352 0.12 0.901 -.5009931 .5688933
---------------------------------------------------------------------------------

How much larger/​smaller probabilities are we talking about? As for binary


logistic (logit) regression, multinomial logistic regression coefficients have no
translucent metric. In order to shed light on the “how much” questions, there-
fore, the coefficients must be transformed into marginal effects or predicted
probabilities for observations with different values on the x-​variables in the
same way as illustrated in section 7.4.
The above analysis had three options/​alternatives for y. The extension to
models with four or more options is straightforward. Interaction effects and
non-​linearity must be examined in the same way as in the binary y case.

7.7 Other related issues and further reading


More can be said about regression models for non-​numerical y-​variables,
and some of it will be covered in Chapter 8. I close this chapter with some
issues that often come up when analyzing binary y-​ variables and some
recommendations for further reading.

a.indd 132 27-Aug-19 15:04:37


133

Logistic (logit) regression 1 133


Logit versus probit models. In certain disciplines, especially economics, a probit
model is more in vogue than the logit model considered here. The difference
between these models boils down to mathematical technicalities that have
no bearing on the substantive results. In Stata, probit models are estimated
with the command “probit y x1 x2”. A rule of thumb says that logit
coefficients often are about 1.7 times larger than probit coefficients (Long and
Freese, 2014). A third alternative is the so-​called scobit (see Nagler, 1994).
Sequential choices. Some choices are sequential. This means that one first
chooses between option A or B and then, conditional on choosing A, chooses
between alternative (i) or (ii). In contrast, if one chose B first, the next choice
could be a no-​choice or the choice between (iii) or (iv). Two examples: (a)
You first decide to go on vacation (yes/​no) and, once that decision is made,
you choose between, say, travelling abroad or travelling domestic. If you first
decided no, the second choice is a no-​choice. (b) You first decide between
public or private transportation and, given the public choice, choose between
bus or train. If you choose private first, the second choice could be between a
car or a bicycle or a motorcycle. The point from a statistical point of view is
that the second choice is made by fewer people than the first, and this makes
model estimation more complicated. Train (2009) has more to offer on this.12
The assumptions of logistic (logit) regression analysis. The assumptions for
linear regression covered in Chapter 5 do in the main also apply for logistic
(logit) regression analysis. Note, however, that heteroscedasticity is an even
more severe problem in the logistic case.
Mediation models in logistic regression. Mediation may also come into play in
logistic regression settings, but unfortunately they are not so straightforward
to estimate as in the linear regression case. Breen et al. (2013) is the starting
point for the study of mediation in logistic regression analysis.

Further reading
Long and Freese (2014) and Train (2009) are already mentioned in this
regard. Other works on how to estimate regression models for categorical/​
non-​numerical y-​variables include Long (1997), Hosmer et al. (2013) and Best
and Wolf (2015).

Notes
1 Binary logistic (logit) regression models are often couched in terms of volun-
tary choices between A and B. However, since the voluntariness in many of these
choices could be questioned, quotation marks are sometimes in order. Long (1997)
offers more on the various theoretical motivations for regression models on cat-
egorical y-​variables.
2 Some questions are direct yes or no questions, i.e. 1 versus 0. Others may be recoded
into one, such as voting democrat (1) or republican (0) and travelling abroad during

a.indd 133 27-Aug-19 15:04:37


134

134 The extensions


vacation (1) or travelling domestic (0). What we label 1 and 0 depends on the
research question at hand, but it is most common to denote yes as 1 and no as 0.
3 In linear regression we are concerned about what happens to the conditional mean
of y when x1 (or x2 or…) increases by one unit. This calls for t-​values. In logistic
regression we similarly examine what happens to the conditional proportion for
which y = 1. This, without going into details, calls for z-​values. As regards signifi-
cance levels (p), this amounts to the same.
4 A third way to describe how x1 affects y in logistic regression, popular in some
areas of research, is by means of so-​called odds ratios. I am no advocate for this
approach since I find the terms “odds” and “odds divided by odds” (i.e. odds
ratios) to be everything but transparent. I am not alone in this assessment; see for
example Long and Freese (2014).
5 The present marginal effect is the “average marginal effect” (AME). There are
other kinds of marginal effects, such as the marginal effect at the mean of all (con-
trol) variables (MEM). See Long and Freese (2014) for more details on this.
6 The marginal effects at the mean (MEM; cf. note 5) are very close to the AME-​
effects reported in the main text: 0.0327 (hours of exercise) and 0.1791 (fitness
center membership).
7 Since the natural logarithm (i.e. Ln on a calculator) is used on the left-​hand side
of the logistic regression, we must use the antilogarithm (i.e. ex on a calculator) to
back-​transform into probabilities (e.g. Ln 5 = 1.6094; e1.6094 = 5).
8 When, say, gender is coded 1 for women and 0 for men, the mean of gender
becomes 0.4 when there are 40% women in the data. Yet in real life students are
either women (1) or men (0), so setting gender to 0.4 makes little sense. An option
more in sync with realty is setting gender to either 1 (women) or 0 (men). This
choice, however, might affect the magnitude of the effects of the other independent
variables in the model, due to the non-​linearity inherent in the logistic model
(Long and Freese, 2014). This is not the same as an interaction effect, which must
be examined in the usual way by adding an interaction term to the model (see
section 7.5). The same reasoning applies for other dummy variables.
9 Some caution is required when making this inference; see Ai and Norton (2003)
for a strategy to become more sure in this regard.
10 The technicalities are covered in detail by Long (1997) and by Train (2009).
11 The command “base(0)” ensures this, whereas “base(2)” and “base(1)” will,
respectively, put the fixed fee and the mixed pay form in the reference group. This
changes the values of the coefficients but not how the overall results of the model
should be interpreted.
12 More complicated models include aspects of the alternatives as independent
variables, such as for example the price or travel time of the different alternatives
in the transportation example. Again, see Train (2009) for more on this.

Exercises with data and solutions


Exercise 7A
Use the data stud_​tourism.dta. A dummy in these data is called “more_​
angst” and refers to whether or not a student answers yes (1) or no (0) to the
question “Due to the fear of terror attacks, are you more anxious than before
about traveling to a European big city?”. Another dummy is gender, coded
1 for females and 0 for males. The third variable of interest, num_​trip, is the
answer to the question “How many trips have you done to big cities in Europe
during the last two years?”.

a.indd 134 27-Aug-19 15:04:37


135

Logistic (logit) regression 1 135


7A1: Estimate the following logit regression model:
more_​
angst = gender + num_​
trip
What does this multiple logistic regression tell you?
7A2: What is the predicted probability of being more anxious about trav-
eling than before for a female student having taken three trips to big cities in
Europe during the last two years? And what is the “similar” probability for a
male student?
7A3: What is the predicted probability of being more anxious about traveling
than before for a female student having taken one trip to big cities in Europe
during the last two years? And what is the “similar” probability for a female
student having taken five trips?
7A4: Is there an interaction effect between gender and num_​trip, indi-
cating that travel experience affects more_​angst differently for males and
females?

Exercise 7A, solution


7A1:
The logit regression coefficient for gender is 0.5597, suggesting that female
students are significantly more likely (p value is 0.007) to report being more
anxious about traveling than male ones. The logit regression coefficient for
num_​trip is –​0.1175, implying that students with more travel experience are
significantly less likely (p value is 0.003) to report being more anxious about
traveling than students with less travel experience. However, these coefficients
tell us nothing about the magnitude of these effects.
7A2:
The predicted probability for females is 38.5%; the similar predicted prob-
ability for males is 26.4%. In other words, females have a 12 percentage points
larger probability of being more anxious than “similar” males with respect to
travel experience.
7A3:
The first predicted probability is about 44%; the second is about 33%. More
travel experience makes female students less anxious about traveling (dynam-
ically interpreted).
7A4:
Since the coefficient for interaction term is small (–​0.0120) and insignificant (p
value is 0.889), the answer is no.

a.indd 135 27-Aug-19 15:04:38


136

8 
An ordered (ordinal) dependent variable
Logistic (logit) regression

8.1 Introduction
Regression models for numerical, binary and unordered categorical y-​variables
are now among the tools in your regression kit. This leaves the ordered or
ordinal y-​variables, mentioned in the appendix to Chapter 1. An ordered vari-
able, as the name implies, means that there is an ordering among its possible
values/​answer categories. A case in point is the Likert scale, where the answer
categories are (coding in parentheses) strongly disagree (1), disagree (2), nei-
ther disagree nor agree (3), agree (4) or strongly agree (5). A person agreeing
(4) is more in agreement than a person disagreeing (2). But it makes little sense
claiming that s/​he is twice as much in agreement, although 4 is twice as much
as 2. The reason is that the assignment of values from 1 to 5 is arbitrary. To see
this, what if 0 to 4 is used as coding instead? Then agreeing (3) is three times as
much in agreement as disagreeing (1). In the words of Long and Freese (2014,
p. 309), “…the distances between the categories are unknown” for ordered
variables. In contrast, for numerical variables, these distances are known.
Other typical ordered y-​variables in the social and economic sciences are
answers to questions like:

-​“How satisfied are you with product X on a scale from 1 to 7, where 1


is very dissatisfied and 7 is very satisfied?” _​_​_​_​_​Write your number here
“How likely is it that you will try product X again during the next
year, where 1 is not likely at all and 7 is very likely?” _​_​_​_​_​Write your
number here
“How is your general health status?” _​_​_​_​_​ very poor _​_​_​_​_​
poor _​_​_​_​_​  ok
              _​ _​_​
_​
_​
good   _​ _​
_​
_​
_​very good

The unknown distance between the values/​ answer categories of ordered


(ordinal) variables makes linear regression a less than optimal modeling
strategy, and the more appropriate procedure is the ordinal regression model.
This is the topic of section 8.2. (If you skipped Chapter 7 because the y in
your current project is ordinal, I encourage you to skim through it before

a.indd 136 27-Aug-19 15:04:38


137

Logistic (logit) regression 2 137


taking on this chapter.) Section 8.3 takes up the interpretation and communi-
cation aspect of the ordinal regression model, where the challenges resemble
those of the binary y covered in section 7.4. Section 8.4 introduces the pro-
portional odds assumption, non-​linearity and interaction effects. In section
8.5 some alternatives to traditional ordinal regression models are introduced,
whereas section 8.6 takes up when it might be appropriate to use linear regres-
sion on an ordered y. Section 8.7 closes with some related topics and offers
some further readings.

8.2 The multiple ordinal (logit) regression model1


Scenario. Someone claims female students struggle more than male students
to find the motivation to exercise. You decide to examine this by adminis-
trating a questionnaire to a random sample of students. The data are called
exe_​motiv_​stud. The dependent variable (exe_​motiv) is a Likert scale
with respect to the statement “I find it easy to get the motivation to exercise”.
The distribution of answers (coded 1 to 5) and descriptive statistics for this

Stata-​output 8.1 Descriptive statistics for the ordinal variable exercise motive (exe_​
motiv).

. use "C:\Users\700245\Desktop\temp_myd\exe_motiv_stud.dta"

. tab exe_motiv

exe_motiv | Freq. Percent Cum.


------------------+-----------------------------------
strongly disagree | 70 13.23 13.23
disagree | 95 17.96 31.19
neither/nor | 142 26.84 58.03
agree | 141 26.65 84.69
strongly agree | 81 15.31 100.00
------------------+-----------------------------------
Total | 529 100.00

. sum exe_motiv

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
exe_motiv | 529 3.128544 1.254733 1 5

variable are shown in Stata-​output 8.1.


In the upper part of the output we see that the two most frequent answers
are “neither disagree nor agree” (coded 3) and “agree” (coded 4); 53% of the
students report this (26.84 + 26.65 = 53.49). The lower part of the output
suggests, if we temporarily view this variable as numerical, that the average
student is slightly on the agreement side of the statement with a mean
of 3.13.

a.indd 137 27-Aug-19 15:04:38


138

138 The extensions


Research question. Do female students, compared to male ones, more often
tend to disagree with the exercise motive statement? Or, conversely, do male
students tend to agree more often with this statement than female ones?2 This
question is examined in Stata-​output 8.2, where the independent variable sub-
jective health status (health) is included as a control. This latter ordinal
x-​variable is coded very poor (0), poor (1), ok (2), good (3) and very good (4).
Stata-​output 8.2 Ordinal logit regression of exercise motivation on gender (gender)
and health status (health).

. ologit exe_motiv i.gender health

Iteration 0: log likelihood = -829.88694


Iteration 1: log likelihood = -758.1889
Iteration 2: log likelihood = -756.81286
Iteration 3: log likelihood = -756.80845
Iteration 4: log likelihood = -756.80845

Ordered logistic regression Number of obs = 529


LR chi2(2) = 146.16
Prob > chi2 = 0.0000
Log likelihood = -756.80845 Pseudo R2 = 0.0881

------------------------------------------------------------------------------
exe_motiv | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
female | -.7043771 .1678889 -4.20 0.000 -1.033433 -.375321
health | 1.20609 .1143761 10.54 0.000 .9819175 1.430263
-------------+----------------------------------------------------------------
/cut1 | .5650401 .3256599 -.0732415 1.203322
/cut2 | 1.839089 .330419 1.19148 2.486698
/cut3 | 3.197755 .3505133 2.510762 3.884749
/cut4 | 4.827386 .37901 4.08454 5.570232
------------------------------------------------------------------------------

The only thing up for interpretation in the output are the signs and the sig-
nificance level of the regression coefficients, as with the binary logit model in
section 7.3. Both are significant. The negative sign for gender (female = 1,
male = 0) implies that female students in general report lower probabilities
of agreeing (4) or strongly agreeing (5) with the exercise statement than male
students. That is, male students find it easier to get the motivation to exer-
cise than female ones –​in keeping with the expectation. The positive sign for
the health variable suggests that students with good or very good health are
more likely to agree or strongly agree with the exercise motive than students
reporting poorer health. Finally, the model has four constants, i.e. /​cut1 and
so on, which in practice are overlooked.3
The reasons why logistic coefficients have no transparent metric were
explicated in section 7.3. We therefore turn directly to the interpretation and
communication aspect of the ordinal logit model.

a.indd 138 27-Aug-19 15:04:39


139

Logistic (logit) regression 2 139

8.3 Interpreting and presenting the results of the ordinal (logit)


regression model
There are various ways to make ordinal regression results interpretable in
terms of exactly how the independent variables of interest affect the ordered
(ordinal) y. The challenge, since there are five answer categories for the
dependent variable (in the present case), is to convey as much information
as possible with the smallest amount of numbers. The predicted probability
approach is one viable means in this regard; another approach is the use of
marginal effects. Both are shown below.
Predicted probabilities. In the present case the research question involves
gender comparisons for an ordered exercise motivation variable, controlling
for health. Stata-​output 8.3 shows the command and results in this respect,
setting the subjective health variable (health) to its mean.4

Stata-​output 8.3 Predicted probabilities for exercise motive answers by gender, based


on the ordinal logit regression in Stata-​output 8.2.

. mtable, at(gender=(0 1)) atmeans noci norownumbers

Expression: Pr(exe_motiv), predict(outcome())

gender strongly_disagree disagree neither/nor agree strongly_agree


----------------------------------------------------------------------------
0 0.064 0.133 0.291 0.342 0.171
1 0.122 0.209 0.327 0.249 0.092

Specified values of covariates

| health
----------+---------
Current | 2.69

The output suggests that 9% of the female students strongly agree with the
statement, whereas the analogous figure for male students is 17% –​an eight
percentage points difference. For the agree option the difference is almost ten
percentage points. This confirms what we already know, i.e. that male students
more easily find the motivation to exercise than female ones ceteris paribus.
For the neither/​nor option there is gender parity, whereas female students
more often than male ones disagree with the motivation statement (21 versus
13%). Finally, female students are twice as likely to strongly disagree with the
motivation statement than their male counterparts (12 versus 6%).
The gender-​specific probabilities above could also be expanded to, say,
students with ok health (coded 2) and students with very good health (coded
4). Stata-​output 8.4 displays the command and results.

a.indd 139 27-Aug-19 15:04:39


140

140 The extensions


Stata-​output 8.4 Predicted probabilities for exercise motive answers by gender and
health status, based on the ordinal logit regression in Stata-​output 8.2.

. mtable, at(gender=(0 1) health=(2 4)) atmeans noci norownumbers

Expression: Pr(exe_motiv), predict(outcome())

gender health strongly_disagree disagree neither/nor agree strongly_agree


--------------------------------------------------------------------------------------
0 2 0.136 0.224 0.326 0.231 0.082
0 4 0.014 0.034 0.116 0.336 0.499
1 2 0.242 0.291 0.283 0.142 0.042
1 4 0.028 0.065 0.192 0.385 0.330

Specified values where .n indicates no values specified with at()

| No at()
----------+---------
Current | .n

The predicted probability of a male student (0) with ok health (2) strongly
agreeing with the exercise motivation statement is 8%; the similar probability
for a male student with very good health (4) is 50%. For female students the
corresponding probabilities are 4 and 33%. Again, we note that male students
and students with (very) good health are most likely to (strongly) agree with
the exercise motive. Female students and students with poor health dominate
at the other end of the exercise motivation scale.
The extension to more complex models involving several controls is
straightforward. What requires deliberate thinking is what to do with the
controls, i.e. what values they should be given. A typical choice for numerical
(and some ordered) variables is the mean or median, but the mean generally
makes little sense for dummies (see note 8 in Chapter 7).
Marginal effects. The benefit of using the marginal effect approach is its
ease of interpretation; it tells what happens to y in percentage points when
x increases by one unit, holding controls constant at some predefined, fixed
level. For an ordered y, however, “what happens to y” could mean sev-
eral things. Stata-​output 8.5 shows the marginal effect of health status on
the exercise motive if the (control) variable gender is set to male students.
Analogously, Stata-​output 8.6 shows the “same” marginal effect if gender is
set to female students.

Stata-​output 8.5 Marginal effect of health status (health) on exercise motive, based


on the ordinal logit regression in Stata-​output 8.2.

. mchange health, at(gender=0) amount(marginal)

ologit: Changes in Pr(y) | Number of obs = 529

Expression: Pr(exe_motiv), predict(outcome())

| strongl~e disagree neither~r agree strongl~e


-------------+-------------------------------------------------------
health |
Marginal | -0.088 -0.097 -0.068 0.079 0.175
p-value | 0.000 0.000 0.000 0.000 0.000

Average predictions

| strongl~e disagree neither~r agree strongl~e


-------------+-------------------------------------------------------
Pr(y|base) | 0.088 0.143 0.257 0.306 0.206

a.indd 140 27-Aug-19 15:04:40


141

Logistic (logit) regression 2 141


If health status increases by one unit, say from ok (2) to good (3), it
becomes almost 9 percentage points less likely that a student strongly dis-
agrees with the motive statement, allowing for a dynamic interpretation. In
contrast, it becomes 17.5 percentage points more likely that a student strongly
agrees with the motive statement.

Stata-​output 8.6 Marginal effect of health status (health) on exercise motive, based


on the ordinal logit regression in Stata-​output 8.2.

. mchange health, at(gender=1) amount(marginal)

ologit: Changes in Pr(y) | Number of obs = 529

Expression: Pr(exe_motiv), predict(outcome())

| strongl~e disagree neither~r agree strongl~e


-------------+-------------------------------------------------------
health |
Marginal | -0.139 -0.095 -0.005 0.121 0.118
p-value | 0.000 0.000 0.647 0.000 0.000

Average predictions

| strongl~e disagree neither~r agree strongl~e


-------------+-------------------------------------------------------
Pr(y|base) | 0.154 0.200 0.279 0.246 0.121

In Stata-​output 8.6 we note that the same one-​unit “health increase”


entails a 14 percentage points smaller probability of strongly disagreeing and
a 12 percentage points larger probability of strongly agreeing. That health
affects the exercise motive differently dependent on whether gender is set to
either males or females is a feature of the non-​linearity inherent in the logistic
regression model. It should therefore not be interpreted as an interaction
effect (see note 8 in Chapter 7).

8.4 The proportional odds assumption, non-​linearity and interaction


effects
The proportional odds-​ assumption. The ordinal logistic (logit) model in
Stata-​output 8.2 is, on closer inspection, four binary logit models estimated
simultaneously (cf. note 3). In this regard the so-​called proportional odds
assumption, which is a special feature of the ordinal logit model, holds that
these binary logit coefficients are close in magnitude (see Long and Freese,
2014, p. 328). This is typically examined with the Brant test; Stata-​output 8.7
shows the result.5

a.indd 141 27-Aug-19 15:04:40


142

142 The extensions


Stata-​output 8.7 Test of the proportional odds assumption for the ordinal logit regres-
sion in Stata-​output 8.2.

. brant

Brant test of parallel regression assumption

| chi2 p>chi2 df
-------------+------------------------------
All | 7.93 0.243 6
-------------+------------------------------
1.gender | 5.56 0.135 3
health | 2.22 0.528 3

A significant test statistic provides evidence that the parallel regression


assumption has been violated.

An insignificant test result, as above, indicates that the proportional odds


assumption has not been violated –​neither in total nor for any of the indi-
vidual independent variables in the model. When it is, other estimation
methods might be considered (see section 8.5).
Non-​linearity. Non-​linearity is addressed by using quadratic terms or by
sets of dummies, as in the binary dependent variable case (section 7.5). The
upcoming example illustrates the dummy variable approach using the same
data as above. The dependent variable is students’ quality of life, i.e. the
answer to the question “How is your general quality of life at present?”. The
distribution for the answers appears in Stata-​output 8.8. The variable has four
answer categories, and most students assess their quality of life as either good
(59%) or ok (25%).

Stata-​output 8.8 Descriptive statistics for the ordinal variable quality of life (qual_​
life).

. tab qual_life

qual_life | Freq. Percent Cum.


---------------+-----------------------------------
very poor/poor | 17 3.21 3.21
ok | 131 24.76 27.98
good | 313 59.17 87.15
very good | 68 12.85 100.00
---------------+-----------------------------------

The independent variables in the ordinal regression model are gender


(as before), the students’ health status (as before) and their financial status
(econ), coded very poor (0), poor (1), ok (2), good (3) and very good (4).
The results of the ordinal regression assuming linear effects are displayed in

a.indd 142 27-Aug-19 15:04:41


143

Logistic (logit) regression 2 143


Stata-​output 8.9. Note the no gender-​difference in quality of life assessment,
i.e. the insignificant gender coefficient. In contrast, better health and better
financial situation appear beneficial for quality of life. That is, students who
are well off in terms of health and finances tend to report higher levels of
quality of life.

Stata-​output 8.9 Ordinal logit regression of quality of life assessment on gender


(gender), health status (health) and financial status (econ).

. ologit qual_life i.gender health econ

Iteration 0: log likelihood = -545.04928


Iteration 1: log likelihood = -483.17129
Iteration 2: log likelihood = -480.17368
Iteration 3: log likelihood = -480.16767
Iteration 4: log likelihood = -480.16767

Ordered logistic regression Number of obs = 529


LR chi2(3) = 129.76
Prob > chi2 = 0.0000
Log likelihood = -480.16767 Pseudo R2 = 0.1190

------------------------------------------------------------------------------
qual_life | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
female | -.0648673 .185848 -0.35 0.727 -.4291227 .2993881
health | 1.139609 .1253375 9.09 0.000 .8939522 1.385266
econ | .5863289 .1085067 5.40 0.000 .3736597 .7989982
-------------+----------------------------------------------------------------
/cut1 | .4619588 .4496377 -.4193148 1.343232
/cut2 | 3.240496 .4291612 2.399356 4.081637
/cut3 | 6.690987 .5135166 5.684513 7.697461
------------------------------------------------------------------------------

There is a taken-​for-​granted issue in the regression in Stata-​output 8.9: It


assumes linear effects of both health and econ on quality of life. This might
be an oversimplification. In contrast, if these two variables are defined as sets
of dummies, this assumption is relaxed and actually tested. Stata-​output 8.10
exemplifies by adding the “i.” prefix command.
In Stata-​output 8.10 the effects of the health dummies are contrasted
with the “very poor” dummy not included in the regression. Note that the
effects increase in size as linearity would have it. The same linear pattern, if
not so clear-​cut, is observed for the financial status dummies. In this sense
the more parsimonious “linear-​effect” model in Stata-​output 8.9 is preferable.
It bears repeating that the coefficients in Stata-​outputs 8.9 and 8.10 must be
transformed into predicted probabilities and/​or marginal effects to be actually
interpretable as effect sizes (see section 8.3).

a.indd 143 27-Aug-19 15:04:42


144

144 The extensions


Stata-​output 8.10 The regression model in Stata-​output 8.9 estimated by gender and
sets of dummies for health and econ.

. ologit qual_life i.gender i.health i.econ

Iteration 0: log likelihood = -545.04928


Iteration 1: log likelihood = -483.65463
Iteration 2: log likelihood = -479.0489
Iteration 3: log likelihood = -478.93768
Iteration 4: log likelihood = -478.93761
Iteration 5: log likelihood = -478.93761

Ordered logistic regression Number of obs = 529


LR chi2(9) = 132.22
Prob > chi2 = 0.0000
Log likelihood = -478.93761 Pseudo R2 = 0.1213

------------------------------------------------------------------------------
qual_life | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
female | -.0337418 .1885801 -0.18 0.858 -.4033521 .3358684
|
health |
poor | 1.803271 1.055825 1.71 0.088 -.2661076 3.87265
ok | 2.787032 .9830605 2.84 0.005 .8602683 4.713795
good | 3.78791 .9862848 3.84 0.000 1.854827 5.720992
very good | 5.165589 1.010412 5.11 0.000 3.185217 7.145961
|
econ |
poor | .9653349 .6141632 1.57 0.116 -.2384029 2.169073
ok | 1.344201 .5608862 2.40 0.017 .2448839 2.443517
good | 2.068292 .5727161 3.61 0.000 .9457894 3.190795
very good | 2.448862 .6456987 3.79 0.000 1.183316 3.714408
-------------+----------------------------------------------------------------
/cut1 | 1.161546 1.068282 -.9322472 3.25534
/cut2 | 3.945096 1.091726 1.805353 6.084839
/cut3 | 7.410376 1.116338 5.222395 9.598358
------------------------------------------------------------------------------

Interaction effects. Interaction effects in ordinal regression models are created


and interpreted in the same way as for a binary y (section 7.5) or a numerical
y (section 6.2), i.e. by creating an interaction variable (term) and then adding
it to the additive regression model. If the interaction term has a significant
effect, we prefer the interaction (multiplicative) model. If not, we keep the
simpler, additive model.

8.5 Alternative modeling strategies


Sometimes the proportional odds assumption is violated. This might suggest
the ordinal regression model not being the ideal choice. For the model in Stata-​
output 8.9, the Brant test suggests that the proportional odds assumption is
violated for gender (p < 0.05; not shown).6 A less restrictive model, assuming
no ordering among the answer categories, is the multinomial regression model

a.indd 144 27-Aug-19 15:04:42


145

Logistic (logit) regression 2 145


covered in section 7.6. Stata-​output 8.11 shows the command and the results
of the multinomial logit regression model using the good quality of life alter-
native (i.e. the most typical answer) as the baseline or reference.7

Stata-​output 8.11 Multinomial logit regression of quality of life assessment on gender


(gender), health status (health) and financial status (econ).

. mlogit qual_life i.gender econ health, base(3)

Iteration 0: log likelihood = -545.04928


Iteration 1: log likelihood = -483.96165
Iteration 2: log likelihood = -476.95494
Iteration 3: log likelihood = -476.90082
Iteration 4: log likelihood = -476.90081

Multinomial logistic regression Number of obs = 529


LR chi2(9) = 136.30
Prob > chi2 = 0.0000
Log likelihood = -476.90081 Pseudo R2 = 0.1250

--------------------------------------------------------------------------------
qual_life | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
very_poor_poor |
gender |
female | -.7787443 .5213847 -1.49 0.135 -1.80064 .243151
econ | -.9997106 .2868687 -3.48 0.000 -1.561963 -.4374583
health | -1.246154 .3539768 -3.52 0.000 -1.939936 -.5523724
_cons | 2.654674 1.028803 2.58 0.010 .6382568 4.671091
---------------+----------------------------------------------------------------
ok |
gender |
female | -.2395571 .2325986 -1.03 0.303 -.695442 .2163277
econ | -.4196126 .1346619 -3.12 0.002 -.6835452 -.15568
health | -.9717151 .1579639 -6.15 0.000 -1.281319 -.6621115
_cons | 2.686474 .5378792 4.99 0.000 1.63225 3.740698
---------------+----------------------------------------------------------------
good | (base outcome)
---------------+----------------------------------------------------------------
very_good |
gender |
female | -.6311667 .2849884 -2.21 0.027 -1.189734 -.0725998
econ | .5419019 .1826358 2.97 0.003 .1839422 .8998616
health | .9590341 .2174268 4.41 0.000 .5328854 1.385183
_cons | -5.46392 .8824489 -6.19 0.000 -7.193488 -3.734352
--------------------------------------------------------------------------------

For the ordinal model in Stata-​output 8.9 there was no gender difference in
terms of quality of life assessments among the students. This is also the case
for reporting good quality of life (i.e. base outcome) as opposed to ok or
very poor/​poor quality of life. That is, the two top gender coefficients are both
insignificant. In contrast, female students have a significant lower probability
(p = 0.027) of reporting very good quality of life as opposed to good quality
of life than male students. In this sense, using the default ordinal regression
model just because the dependent variable “happens to be ordinal” would
mask a (possibly) important gender difference (see note 7).

a.indd 145 27-Aug-19 15:04:43


146

146 The extensions


The generalized ordered logit/​partial proportional odds model is another
modeling strategy if the proportional odds assumption is violated (see
Williams, 2006). This model, to simplify, strikes a balance between the more
parsimonious ordered logit model and the multinomial logit model. The
model is estimated in Stata with the add-​on command gologit2, which
may be downloaded from the Internet (write “findit gologit2” in the
command window, press Enter and follow the instructions).
Sequential ordered choices. Some y-​variables are ordered, but in a way that
resembles the steps of a ladder. A case in point is the educational system,
where one has to finish at one level before starting on the next and so on.
Similarly, in karate, the first belt color one gets is white, the second is yellow
and so on with new colors until a selected few finally get the black belt. What
characterizes such processes is that fewer and fewer people stay in the system.
For such processes the sequential logit model is the preferred choice. This
model may be estimated in Stata with the add-​on command seqlogit,
which may be downloaded from the Internet (write “findit seqlogit” in
the command window, press Enter and follow the instructions).

8.6 Using linear regression on ordered (ordinal) y-​variables: Is it


justified?
Many ordered y-​variables have seven or ten values/​answer categories, such
as the question “How likely is that you will travel to a big European city
during the next year? (1 = very unlikely; 7 = very likely) _​_​_​_​_​_​_​_​Write your
number here”. The typical choice in the social and economic sciences is to
use linear regression on such y-​variables. Is this wise? My personal take is
that it depends. When lots of people have answered either 1, 2, … or 7 for the
question above, linear regression works well in most cases.8 However, we often
see that 60–​70% answer 1 and 2 or 6 and 7 for variables like the one above.
That is, although the variable is ordinal by definition it comes very close to
being a binary either/​or choice in reality.
The frequency distribution and descriptive statistics for the above variable
(prob_​trav) among a random sample of students are displayed in Stata-​
output 8.12. The data are stud_​tourism.dta. The output shows that
51% are very sure (i.e. answering 7) about traveling, whereas 19% are sure
(i.e. answering 6). The remaining answers from 1 to 5, thus, only account for
30% of the total. In statistics lingo we are talking about a skewed distribution.
The point to notice is that there is much less variation for the y-​variable than
the surface impression created by the 7-​point scale. In reality we have three
groups of potential travelers: the non-​likely (1–​5), the likely (6) and the very
likely (7). And since this “new” variable is ordinal at best, the ordinal logit
model is probably to be preferred.

a.indd 146 27-Aug-19 15:04:43


147

Logistic (logit) regression 2 147


Stata-​output 8.12 Descriptive statistics for the ordinal variable travel probability
(prob_​trav).

. use "C:\Users\700245\Desktop\temp_myd\stud_tourism.dta"

. tab prob_trav

prob_trav | Freq. Percent Cum.


------------+-----------------------------------
1 | 18 3.54 3.54
2 | 9 1.77 5.31
3 | 29 5.71 11.02
4 | 32 6.30 17.32
5 | 64 12.60 29.92
6 | 95 18.70 48.62
7 | 261 51.38 100.00
------------+-----------------------------------
Total | 508 100.00

. sum prob_trav

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
prob_trav | 508 5.84252 1.598447 1 7

8.7 Final thoughts and further reading


More can be said about ordered (ordinal) dependent variables, but the above
suffices for most cases in economics and the social sciences –​at least as an
introduction. I close with addressing some points that might come when ana-
lyzing ordered y-​variables.
Collapsing an ordered dependent variable into a dummy. A very common
choice in applied work is to overlook that y is ordered, dichotomize it into a
dummy and then use the logit approach discussed in Chapter 7. If we look at
the travel probability variable in Stata-​output 8.12 as an example, this could
mean recoding the values from 1 to 5 into 0 (small probability) and 6 and 7
into 1 (large probability) to get a travel probability dummy. A more typical
case is the dichotomization of a five-​point health status variable (very poor,
poor, ok, good and very good) into poor/​ok (0) or good health (1). It is not
necessarily wrong to do this, but it throws away information in the data that
might be relevant in real life. For this reason, I generally do not recommend
this procedure, although there might be situations in which it is defensible.
Logit versus probit. There is also a probit alternative to most of the logit
models considered in this chapter, as touched upon in section 7.7. And again
the difference between logit and probit is merely of mathematical interest
since the preference for either option has no consequences for the substantive
results.

a.indd 147 27-Aug-19 15:04:43


148

148 The extensions

Further reading
Long (1997, 2015), Long and Freese (2014), Fullerton (2009) and Greene and
Hensher (2010) are all noteworthy introductions to, and explications of, the
ordinal regression model.

Notes
1 The ordinal (logit) regression model is often traced back to McKelvey and Zavoina
(1975). In biostatistics the model is called the proportional odds model.
2 If linear regression was used to examine this research question, this would involve
scrutinizing how the gender dummy affected the mean of the five-​point Likert scale.
More on this in section 8.6.
3 The technical background for the four constants or cut points is that the ordinal
logit model simultaneously estimates four binary logit models in the present
case: strongly disagree (1) versus all higher values (2 to 5), strongly disagree and
disagree (1 and 2) versus all higher values (3 to 5), strongly disagree and disagree
and neither/​nor (1 to 3) versus all higher values (4 and 5) and strongly disagree and
disagree and neither/​nor and agree (1 to 4) versus all higher values (5).
4 The “mtable” and upcoming “mchange” command (see Stata-​outputs 8.5/​8.6)
are among the suite of commands in the add-​on “spost13_​ado”, available on
the Internet (write “findit spost13_​ ado” in the command window of Stata,
press Enter and follow the instructions). Long and Freese (2014) developed these
commands.
5 This test is among the suite of commands in the spost13_​ado package (see
note 4).
6 Long and Freese (2014) caution against putting too much weight on this test and
related ones.
7 Assuming no ordering among the alternatives of y is always safer than assuming
an order. That is, although an y-​variable is ordinal by definition, this does not by
necessity imply that ordinal regression is the most appropriate estimation procedure
(see Long and Freese, 2014).
8 “Lots of people” may of course mean lots of observations/​units. Linear regression
will typically also work fine for a five-​or six-​point scale provided that the distribu-
tion on y is not too skewed. This modeling choice may be justified in the same way
as linear regression (i.e. the LPM) may be justified on a binary y (see section 7.2).

Exercises with data and solutions

Exercise 8A
Use the data stud_​tourism.dta. An ordinal variable in these data called
“safe” is the answer to the question “To what extent do you consider it safe
to travel to a European big city?” (1 = very unsafe; 7 very safe). Another
dummy is gender, coded 1 for females and 0 for males. The third variable of

a.indd 148 27-Aug-19 15:04:44


149

Logistic (logit) regression 2 149


interest, num_​trip, is the answer to the question “How many trips have you
done to big cities in Europe during the last two years?”.
8A1: Estimate the following ordered logit regression model:

Safe = gender + num_​


trip

What does this multiple ordered logistic regression tell you?


8A2: Setting num_​trip at its mean, what is the predicted probability that a
female student finds it very safe to travel (i.e. answering 7)? And what is the
“similar” probability for a male student?
8A3: Is there an interaction effect between gender and num_​trip, indi-
cating that travel experience affects safe differently for the genders?

Exercise 8A, solution


8A1:
The ordered logit coefficient for gender is –​0.8090, suggesting that female
students, as compared to male ones, have lower probabilities of answering
with numbers towards the safer end of the assessment. The effect is significant
(p value is < 0.0001). In other words, male students find it safer to travel than
female ones. The ordered logit coefficient for num_​trip is 0.1174, implying
that students with more travel experience are significantly more likely (p value
is < 0.0001) to answer with numbers towards the safer end of the assessment
than students with less travel experience. As we have seen before in Chapters 7
and 8, however, these coefficients tell us nothing about the magnitude of these
effects.
8A2:
The female probability is 13%; the “similar” male probability is 25%. Men as
expected find it safer to travel than women.
8A3:
Since the coefficient for interaction term is small (0.0302) and insignificant (p
value is 0.619), the answer is no.

a.indd 149 27-Aug-19 15:04:44


150

9 
The quest for a causal effect
Instrumental variable (IV) regression

9.1 Introduction
So far, we have seen several times that the estimate of b1 for x1 on y changes
in size when a new independent variable, x2, is added to the regression. Given
that much quantitative research is about trying to pinpoint exactly how x1 –​ as
measured by the size of b1 –​ affects y holding other relevant control variables
constant, we’re up for a big challenge. How come? As mentioned in Chapter 3,
data limitations invariably put us in a position lacking access to some of these
relevant controls, making our estimate of b1 uncertain. In real life, b1 might
be smaller or it might be larger; we simply do not know. This chapter is about
trying to get the correct –​as in unbiased –​estimate of b1 for x1 on y when we
have reason to believe that our regression model lacks one or more relevant
control variables.
Let’s assume we have a regression model in which b1 for x1 on y is 5, and
that this regression includes all relevant control variables and is otherwise cor-
rectly specified. This implies that no matter how many other controls we add
to the regression, the size of b1 (i.e. 5) will not change: The estimate of b1 is
unbiased. If so, we may tentatively refer to this regression coefficient (or effect)
as the causal effect.1 Finding the causal effect of x1 on y is mostly discussed
in the realm of doing experiments that involve manipulating the x-​variable of
interest; more on this in Chapter 10. In this chapter we aim to pinpoint causal
effects for observational data where we as analysts are passive collectors of
data. The procedure is known as instrumental variable (IV) regression.
IV regression is both simple and hard at the same time. Basically, and
compared with regression in the usual linear manner, we just need one add-
itional independent variable, w. This w is called an instrument or instrument
variable and should meet two requirements: (1) It should be correlated with
x1. (2) It should not affect y, other than indirectly through x1.2 (More on this
in section 9.3.) The hard part is finding instrument variables that actually
meet both requirements.
The rest of Chapter 9 is organized as follows: Section 9.2 introduces
an example of IV regression using data on students. The variable relation-
ship of interest is between exercise behavior and health. Section 9.3 extends

a.indd 150 27-Aug-19 15:04:44


151

Instrumental variable (IV) regression 151


and problematizes the analyses of section 9.2, whereas section 9.4 discusses
scenarios in which x1 and/​or y are dummies. Section 9.5 offers some final
thoughts and further readings on IV regression.

9.2 A first take on the relationship between exercising (x1)


and health (y)
A vast number of cross-​sectional studies have documented positive regres-
sion coefficients between exercising and various measures of health. That is,
(more) exercise appears beneficial for (improved) health. But is the b1 for exer-
cise an unbiased causal effect? Skimming through the research, a striking fea-
ture is that a number of relevant controls are left out of the regressions –​e.g.
dietary habits, BMI, health consciousness, medication history, prior exercise
levels during adolescence and so on. For this reason, the b1 for exercise is
probably biased and thus giving it a causal interpretation is dubious.3
Let’s take a look at this relationship using real-​life data from a survey of
students. The data are called iv_​phys_​health.dta. The y is a subjective
assessment of one’s own health, which for simplicity I treat as a numerical
variable with values ranging from 0 (very poor) to 4 (very good). The mean of
this y is 2.79, and the standard deviation is 0.77 (results not shown). The key
independent variable is exercise measured in hours per week (mean = 4.67;
SD = 3.19; results not shown). The linear regression coefficient for the exercise
variable, b1, is shown in Stata-​output 9.1.

Stata-​output 9.1 Linear regression of students’ health on hours of weekly exercise


and controls.

. use "C:\Users\700245\Desktop\temp_myd\iv_phys_health.dta"

. regress health phys_act_h gender youth_sport

Source | SS df MS Number of obs = 676


-------------+---------------------------------- F(3, 672) = 39.50
Model | 60.4993322 3 20.1664441 Prob > F = 0.0000
Residual | 343.090904 672 .510551941 R-squared = 0.1499
-------------+---------------------------------- Adj R-squared = 0.1461
Total | 403.590237 675 .597911462 Root MSE = .71453

------------------------------------------------------------------------------
health | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .0869533 .0089926 9.67 0.000 .0692963 .1046103
gender | .1202736 .0602493 2.00 0.046 .001974 .2385731
youth_sport | .0300321 .0098181 3.06 0.002 .0107543 .0493099
_cons | 2.110347 .0933509 22.61 0.000 1.927052 2.293641
------------------------------------------------------------------------------

The output says that exercise has a positive and significant effect on health,
as to be expected. The b1 of ≈ 0.087 dynamically implies that, say, six more
hours of exercise per week entails 0.52-​point better health on the 0 to 4 scale
(0.087 × 6 = 0.522). The output also suggests females having slightly better
health than males (gender is coded 1 for female and 0 for male). The variable

a.indd 151 27-Aug-19 15:04:45


152

152 The extensions


youth_​sport measures how active the students were in sports during ado-
lescence, with inactive (1) to very active (10) as the range of possible answers.
As expected, this control variable has a positive and significant effect on
students’ health.
Although the regression in Stata-​output 9.1 includes two controls, a number
of relevant controls are omitted from the model; see the list of candidates
three paragraphs above. What will happen to the regression coefficient for
exercising, b1, if these controls are added to the regression? The answer is
one of three things: (1) b1 will remain unchanged at 0.087, (2) b1 will decrease
in size or (3) b1 will increase in size. But a priori we have no way of knowing
which of these alternatives is the more likely, and we have no way of testing
this directly due to data limitations. We thus cannot rule out the possibility
that our b1 is biased, and for this reason (see also note 3) we cannot interpret
it as a causal effect.4
We now turn to IV regression in order to obtain an unbiased b1 for a pos-
sible causal interpretation. To execute an IV regression in the present context,
we need an instrument variable that satisfies two conditions: (1) It should be
correlated with hours of exercise. (2) It should not affect health, other than
indirectly via hours of exercise. Our first candidate is the dummy sport club
membership (sport_​club; yes = 1, no = 0). The IV command in Stata and
the results appear in Stata-​output 9.2.5

Stata-​output 9.2 Instrumental variable (IV) regression of students’ health on hours of


weekly exercise and controls.
Sport club membership (sport_​club) is instrument variable.

. ivregress 2sls health gender youth_sport ( phys_act_h = sport_club)

Instrumental variables (2SLS) regression Number of obs = 676


Wald chi2(3) = 31.40
Prob > chi2 = 0.0000
R-squared = 0.0180
Root MSE = .76567

------------------------------------------------------------------------------
health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .178763 .057619 3.10 0.002 .0658319 .2916941
gender | .2332998 .0951796 2.45 0.014 .0467513 .4198483
youth_sport | .0115949 .0155187 0.75 0.455 -.0188213 .0420111
_cons | 1.728247 .2567162 6.73 0.000 1.225093 2.231402
------------------------------------------------------------------------------
Instrumented: phys_act_h
Instruments: gender youth_sport sport_club

The IV regression finds the b1 for exercise to be ≈ 0.179, i.e. about twice as
large as the linear regression b1. (We pay no attention to the effects of other
variables in IV regression; this method is all about obtaining the unbiased

a.indd 152 27-Aug-19 15:04:45


153

Instrumental variable (IV) regression 153

estimate of b1 for the independent variable of key interest.) Note also that the
standard error (SE) of the IV coefficient is much larger than the analogous
SE in the linear regression case; this is a general feature of IV regression (see
Wooldridge, 2000).
Taking stock, the IV regression finds the effect of exercise on health to be
much larger than what the “naïve” linear regression suggests. Is this a tenable
result? We’ll return to this question later; now we take a closer look at the
assumptions that IV regression is built on.

9.3 The assumptions of IV regression: Instrument relevance and


instrument exogeneity
To work properly –​that is, to produce an unbiased estimate of b1 –​IV regres-
sion rests on two key assumptions regarding the instrument variable(s): instru-
ment relevance and instrument exogeneity. We address them in turn below.
Instrument relevance. As hinted at, this means that the instrument variable, w,
should correlate with x1. In our case this means that membership in a sport club
should correlate with hours of exercise per week in a positive manner. This is
easy to test: Immediately after the Stata command yielding Stata-​output 9.2,
one writes the command “estat firststage”. This produces the output in
Stata-​output 9.3; the results of the so-​called first-​stage regression.

Stata-​output 9.3 Test of instrument relevance for the instrument variable sport club
membership.

. estat firststage

First-stage regression summary statistics


--------------------------------------------------------------------------
| Adjusted Partial
Variable | R-sq. R-sq. R-sq. F(1,672) Prob > F
-------------+------------------------------------------------------------
phys_act_h | 0.1080 0.1040 0.0280 19.3364 0.0000
--------------------------------------------------------------------------

Minimum eigenvalue statistic = 19.3364

Critical Values # of endogenous regressors: 1


Ho: Instruments are weak # of excluded instruments: 1
---------------------------------------------------------------------
| 5% 10% 20% 30%
2SLS relative bias | (not available)
-----------------------------------+---------------------------------
| 10% 15% 20% 25%
2SLS Size of nominal 5% Wald test | 16.38 8.96 6.66 5.53
LIML Size of nominal 5% Wald test | 16.38 8.96 6.66 5.53
---------------------------------------------------------------------

a.indd 153 27-Aug-19 15:04:46


154

154 The extensions


The R2 of 0.108 is the explained variance for the regression of phys_​
act_​h on sport_​club and the other independent variables in the model
(gender and youth_​sport), implying that they combined account for
11% of the variance in hours of exercise per week. The F-​statistic (or min-
imum eigenvalue statistic) is the benchmark for evaluating whether the instru-
ment variable is “relevant” or not. A rule of thumb says that the F-​statistic
should exceed 10 to be relevant.6 Since the F-​statistic here is 19.33, we have
every reason to call the instrument variable relevant or “strong enough”.
Instrument exogeneity. This assumption, which was also mentioned earlier,
holds that w should not affect (or be correlated with) y other than through
x1. The assumption is not testable empirically in the present case since we
only have one instrument variable. (We return to an indirect test of exogeneity
in the presence of several instrument variables below.) In other words, the
exogeneity assumption rests on theoretical considerations, subject knowledge
and common sense. In this case the question becomes whether or not it is rea-
sonable to argue that sport club membership in itself (i.e. other than through
increased exercise levels) has no direct impact on health. I think this might be
the case, at least tentatively.
More than one instrument variable. Sometimes we have access to data with sev-
eral candidates for instrument variables. In these circumstances it is possible
to test exogeneity indirectly. In the analysis below, the variable membership of
fitness center is added as an instrument; see Stata-​output 9.4.
The upper part of the output repeats the information from Stata-​output
9.2, but now for the case with two instrument variables. Note that the b1 esti-
mate of e­ xercise –​0.123 –​is smaller than for the first IV regression (0.179), but
still larger than for the linear regression (0.087). The two instrument variables
are collectively and unsurprisingly relevant or “strong”, with an F-​statistic
of 77.71.
The command “estat overid” was written right after the command
“estat firststage”.
The null hypothesis is that both instruments are exogenous; this cannot
be rejected since the p value is much larger than 0.05 (i.e. 0.26). That is, the
instrument variables appear exogenous.7 If we add a third instrument vari-
able to the IV regression –​friends’ exercise levels (friend_​phys) –​ the
b1 estimate of exercise becomes 0.128, the F-​statistic becomes 52.54 and
the p value for the exogeneity test becomes 0.068 (results not shown). In
other words, the IV regression again yields a larger effect than the linear
regression model.

a.indd 154 27-Aug-19 15:04:48


155

Instrumental variable (IV) regression 155


Stata-​output 9.4 Instrumental variable (IV) regression of students’ health on hours of
weekly exercise and controls.
Sport club membership (sport_​club) and fitness center membership (fitness_​ c)
are instrument variables.

. ivregress 2sls health gender youth_sport (phys_act_h = sport_club fitness_c)

Instrumental variables (2SLS) regression Number of obs = 676


Wald chi2(3) = 59.07
Prob > chi2 = 0.0000
R-squared = 0.1297
Root MSE = .72083

------------------------------------------------------------------------------
health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .122888 .0209191 5.87 0.000 .0818874 .1638886
gender | .1645125 .0650596 2.53 0.011 .0369981 .2920269
youth_sport | .0228157 .0106033 2.15 0.031 .0020336 .0435978
_cons | 1.960791 .1225684 16.00 0.000 1.720562 2.201021
------------------------------------------------------------------------------
Instrumented: phys_act_h
Instruments: gender youth_sport sport_club fitness_c

. estat firststage

First-stage regression summary statistics


--------------------------------------------------------------------------
| Adjusted Partial
Variable | R-sq. R-sq. R-sq. F(2,671) Prob > F
-------------+------------------------------------------------------------
phys_act_h | 0.2549 0.2505 0.1881 77.7107 0.0000
--------------------------------------------------------------------------

Minimum eigenvalue statistic = 77.7107

Critical Values # of endogenous regressors: 1


Ho: Instruments are weak # of excluded instruments: 2
---------------------------------------------------------------------
| 5% 10% 20% 30%
2SLS relative bias | (not available)
-----------------------------------+---------------------------------
| 10% 15% 20% 25%
2SLS Size of nominal 5% Wald test | 19.93 11.59 8.75 7.25
LIML Size of nominal 5% Wald test | 8.68 5.33 4.42 3.92
---------------------------------------------------------------------

. estat overid

Tests of overidentifying restrictions:

Sargan (score) chi2(1) = 1.2464 (p = 0.2642)


Basmann chi2(1) = 1.23946 (p = 0.2656)

Three conclusions emerge from the IV regressions above; two methodo-


logical and one substantive: (1) The instrument variables appear valid, but
more could probably be said on this issue. (2) The magnitude of the exercise
effect, b1, on health depends on the choice of instrument variables.8 (3) The
effect of exercise on health appears stronger throughout the IV analyses than

a.indd 155 27-Aug-19 15:04:48


156

156 The extensions


what the linear regression model suggests. The latter is interesting in light
of the study sparking the interest for the present analyses; see Brechot et al.
(2017). As in our case, these authors found a positive and significant rela-
tionship between exercise and various aspects of health using linear regres-
sion. Their IV estimates of this relationship, in contrast, were close to zero
and hence insignificant. Although I find this no-​finding puzzling –​“Are so
many people worldwide really exercising for nothing?” –​I’m not saying that
these authors got it wrong. Neither am I saying the present analyses paint
the true picture of the relationship between exercise and health. But since a
no-​relationship between exercise and health goes against intuition, common
sense and the majority of prior studies, this at the very least opens up a dis-
cussion on the validity of these authors’ choice of instrument variable (see
also section 9.5).

9.4 When x1 and/​or y are dummies


When x1 is a dummy. In the preceding analyses, x1 (i.e. exercise) was a numer-
ical variable; what if x1 is a dummy? An “obvious” choice in this respect would
be to use some kind of logit/​probit model for the first-​stage regression. As it
turns out, however, this is not necessary. That is, one may use the IV regres-
sion procedure illustrated so far in this chapter also when x1 is binary (see
Angrist and Pischke, 2009).
When y is a dummy. Sometimes y is binary, like we saw in Chapter 7 with the
logit/​probit models. The built-​in Stata command ivprobit is typically the
preferred procedure for the instrumental variable (IV) equivalent of a binary
y. We continue with the student data, but now the dependent health vari-
able has only two values: less than good (0) or good/​very good (1).9 (66%
of the students report their health to be good/​very good; result not shown).
The IV probit regression with the three instrument variables used earlier in
the chapter is displayed in Stata-​output 9.5.
The IV probit regression coefficient –​which has no intuitive interpretation
as to the magnitude of the effect (cf. Chapter 7) –​is 0.215 and significant,
implying that more hours of weekly exercise entail a larger probability of
reporting good/​very good health as opposed to poorer health.10 The analo-
gous plain vanilla probit regression is reported in Stata-​output 9.6. Note that
the probit coefficient is 0.144, i.e. 33% smaller than the IV probit estimate.
Again, the IV regression suggests that exercise has a stronger effect on health
than what the non-​IV regression suggests.11

a.indd 156 27-Aug-19 15:04:49


157

Stata-​output 9.5 Instrumental variable (IV) probit regression of students’ health


(dummy) on hours of weekly exercise and controls.
Sport club membership (sport_​club), fitness center membership (fitness_​ c) and
friends’ exercise levels (friend_​phys) are instrument variables.

. ivprobit health_good gender youth_sport (phys_act_h=sport_club fitness_c friend)

Fitting exogenous probit model

Iteration 0: log likelihood = -431.42959


Iteration 1: log likelihood = -392.53289
Iteration 2: log likelihood = -392.21573
Iteration 3: log likelihood = -392.21557
Iteration 4: log likelihood = -392.21557

Fitting full model

Iteration 0: log likelihood = -2035.2


Iteration 1: log likelihood = -2035.1218
Iteration 2: log likelihood = -2035.1218

Probit model with endogenous regressors Number of obs = 676


Wald chi2(3) = 55.26
Log likelihood = -2035.1218 Prob > chi2 = 0.0000
-----------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. interval]
-------------------+---------------------------------------------------------------
phys_act_h | .2145108 .0346288 6.19 0.000 .1466395 .282382
gender | .3084178 .1175886 2.62 0.009 .0779483 .5388872
youth_sport | .0287875 .0193683 1.49 0.137 -.0091735 .0667486
_cons | -.9256514 .2118562 -4.37 0.000 -1.340882 -
.5104209
-------------------+---------------------------------------------------------------
corr(e.phys_act_h,|
e.health_good)| -.2541837 .1105991 -.455504 -.02813
sd(e.phys_act_h)| 2.749963 .0748033 2.607191 .900553
-----------------------------------------------------------------------------------
Instrumented: phys_act_h
Instruments: gender youth_sport sport_club fitness_c friend_phys
-----------------------------------------------------------------------------------
Wald test of exogeneity (corr = 0): chi2(1) = 4.83 Prob > chi2 = 0.0280

Stata-​output 9.6 Probit regression of students’ health (dummy) on hours of weekly


exercise and controls.

. probit health_good phys_act_h gender youth_sport

Iteration 0: log likelihood = -431.42959


Iteration 1: log likelihood = -394.8526
Iteration 2: log likelihood = -394.55151
Iteration 3: log likelihood = -394.55136
Iteration 4: log likelihood = -394.55136

Probit regression Number of obs = 676


LR chi2(3) = 73.76
Prob > chi2 = 0.0000
Log likelihood = -394.55136 Pseudo R2 = 0.0855

------------------------------------------------------------------------------
health_good | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_act_h | .1444594 .0199195 7.25 0.000 .1054179 .1835009
gender | .2226788 .113964 1.95 0.051 -.0006865 .4460441
youth_sport | .0443555 .0181226 2.45 0.014 .0088359 .0798751
_cons | -.6309021 .1766503 -3.57 0.000 -.9771304 -.2846739
------------------------------------------------------------------------------

a.indd 157 27-Aug-19 15:04:49


158

158 The extensions

9.5 Final thoughts and further reading


The lack of access to certain relevant control variables is the rule rather
than the exception for the observational data typically analyzed in the social
sciences. For this reason, pinpointing the exact and unbiased effect of x1 on y
is a difficult task –​if possible at all. Given certain circumstances, described in
this chapter, one might with a big portion of sound judgement and perhaps
a little bit of luck “arrive at” an unbiased estimate of b1 that may be given a
causal interpretation. The transportation vehicle in this regard is instrumental
variable (IV) regression analysis, and it is first and foremost an extension of
the linear regression model.
IV regression is an obvious statistical tool for scientific investigations
evolving around the following schematic research questions using
observational data:

• Does involvement in program x (yes/​no) entail that program participants


score higher/​lower on y than non-​participants?
• Does involvement in program x (yes/​no) entail that program participants
are more/​less likely to do y than non-​participants?
• Does the intensity of engagement in program/​activity x affect the prob-
ability of doing y and/​or the amount of y?

In these settings –​i.e. when all relevant control variables are most probably
not available in the datasets up for analysis –​IV regression may be a key tool
in the quantitative social scientist’s toolbox. That said, IV regression is by no
means a quick-​fix, as the next paragraph emphasizes.
A cautionary tale. IV regression is neither a mechanistic solution nor a magic
bullet in terms of getting an unbiased b1 for x1 on y using observational data.
On the contrary, the procedure rests on strong assumptions that in most
settings are not empirically testable but rely on sound, theoretical judgement.
A related issue concerns whether or not the IV regression confirms the basic
story told by for example a linear regression. If it does –​or, as in our case
above, amplifies it –​we generally have stronger support for our initial hypoth-
esis. In contrast, when the IV story radically diverges from the linear regres-
sion story, we tend to have a problem…

Further reading
Despite the fact that there are specific IV regression introductions to a number
of the sub-​disciplines in the social science (see below), I recommend Stock
and Watson (2007) for every prospect social scientist as the introduction to
IV regression. A more recent introduction is Muller et al. (2015). Other than
this, we have:

a.indd 158 27-Aug-19 15:04:50


159

Instrumental variable (IV) regression 159


For economists in particular: Imbens (2014) and Angrist and Pischke
(2009).
For sociologists/​demographers: Bollen (2012) and Morgan and Winship
(2015).
For political scientists: Sovey and Green (2010).
For epidemiologists: Hernán and Robins (2006) and Martens et al. (2006).
For a general introduction to the cause-​and-​effect (or causality) litera-
ture, see Pearl and Mackenzie (2018) or Rosenbaum (2017).

Notes
1 Causal effects are typically discussed with regard to medical experiments (trials).
Here, one group of subjects gets treatment (e.g. a pill containing real medicine),
whereas the other group gets no treatment (a placebo pill containing no real medi-
cine). Furthermore, the assignment to treatment or placebo is based on a random
procedure; e.g. flip of a coin. This treatment/​placebo variable might be thought of as
a dummy, x1. Two months later, both groups undergo a diagnostic medical test, i.e.
the dependent variable, y. The b1 of this x1-​y regression is the average causal effect
or treatment effect. More complicated designs involve different dosages (placebo,
light, moderate, heavy, very strong), but the same logic holds. I have more to say on
regression analysis for experimental data in the social sciences in section 10.2.
2 Rosenbaum (2017) argues, in my view correctly, that an instrument is the proper
term, but since an instrument variable seems to be the term typically used in the
literature, I’ll stick with the latter.
3 Another reason why the exercise effect on health probably is not causal is that we
cannot in a cross-​sectional context rule out the possibility that health affects exer-
cising; i.e. that people who are healthy for, say, primarily genetic reasons, are more
able to exercise more often. As it turns out, without going into details, IV regression
also tackles this simultaneity problem. See also section 11.4 on this problem.
4 The exercise variable, x1, is called endogenous in econometrics lingo, which means
correlated with y’s (here: health’s) error term, e. In layman speak we would call it
“problematic” in the sense that the estimate of b1 probably is biased. Or, at least,
that this cannot be ruled out.
5 The “2sls” in the command is short for two-​stage least squares estimator; the instru-
ment variable sport_​club is written after the “=” in the parenthesis at the end of
the command.
6 If the F-​statistic is < 10, the instrument is said to be “weak”. This, in general, makes
the IV estimate of b1 less credible. See Murray (2006) for more on weak instruments.
7 The logic behind this test is spelled out in Stock and Watson (2007, p. 443–​445).
It must be emphasized, however, that a non-​significant test result by no means
confirms that the instrument variables are exogenous.
8 This, at first glance, seems to be the rule rather than the exception in the literature,
emphasizing how the quality of the instrument variables are fundamental requisites
to the trustworthiness of IV results.
9 To create this new dummy variable –​i.e. health_​good –​I first used the initial
commands:
gen health_​
good = health
recode health_​good 0/​2=0 3/​4=1

a.indd 159 27-Aug-19 15:04:50


160

160 The extensions


10 The significant Wald test at the bottom of Stata-​output 9.5 suggests that the exer-
cise variable is indeed endogenous or “problematic” (see note 4). In other words,
the IV procedure is again called for.
11 Using the traditional ivregress command from section 9.2, which Angrist
and Pischke (2009) actually recommend even when y is a dummy, yields an IV
regression coefficient of 0.68 (results not shown). In contrast, the linear regression
coefficient is 0.042 (results not shown). Again, the IV regression gives credence
to a stronger relationship between exercise and health than the non-​IV, linear
regression.

a.indd 160 27-Aug-19 15:04:50


161

Part 4

Regression purposes, academic


regression projects and
the way ahead
In Part 4, “Regression purposes, academic regression projects and the way
ahead”, you will get to know:

• The differences and similarities between various purposes of, or motiv-


ations for, regression analysis as used in academic settings
• How to organize an academic regression project
• About some practical advice that might come in handy in a regression
project
• About some related extensions of the classical regression models covered
so far in this book

a.indd 161 27-Aug-19 15:04:50


162

a.indd 162 27-Aug-19 15:04:50


163

10 
Regression purposes in various
academic settings and how
to perform them

10.1 Introduction
Up until now the focus of this book has been on how regression works, how
to do it on a computer and how to interpret the results that statistics programs
like Stata, R or SPSS produce. This chapter brings together this material and
puts it into the realm of doing and reporting regression analysis in various
academic settings. Section 10.2 presents four different ways of doing regres-
sion analysis according to its main purpose or motivation. In practice, though,
the boundaries between these ways of doing regression are (unfortunately)
often blurred in current research. Section 10.3 presents the typical structure
or setup of a regression-​based study, whereas section 10.4 closes with some
handy advice on how to proceed with a regression-​based academic project.

10.2 Four purposes of regression analysis: (1) prediction,


(2) description, (3) making inferences beyond the data and
(4) causal analysis
Pressing the point, one might say there are four distinct purposes of, or motiv-
ations for, regression analysis. Three of them are typical in academic settings;
one is not. We start briefly with the latter.
1. Regression as prediction. Bankers are interested in potential customers’
credit score ratings and education admission officers in future students’ aca-
demic success at the university. Credit scores and academic success may both
be operationalized as y-​variables with scores from, say, 0 (poor) to 10 (best).
In a predictive research effort, the goal is to find the subset of x-​variables that
maximizes the statistical explanation of these y-​variables. That is, the goal is to
find the combination of x-​variables that yields the highest R2. The specifics of
the x-​variables that fills this purpose, however, is often of secondary interest;
the contribution to increased R2 trumps theoretical relevance.1
In section 2.3 we predicted the sale prices of renovated and non-​renovated
houses based on the numbers in the regression equation. In a similar vein,
bankers predict credit score ratings based on their “best” regression model
(typically developed by a trial-​and-​error procedure) which, in turn, puts

a.indd 163 27-Aug-19 15:04:50


164

164 Regression projects and the way ahead


potential customers below or above some threshold value (no loan or loan
granted). The same logic applies in the education admission case, with college
acceptance or not as the end result. Both examples refer to future predictions
based on cross-​sectional data; Pedhazur (1997) has more to offer on regression
as prediction in this respect. Another type of regression predictions involves
trying to say something about future events based on historical data. This is
relevant for the analysis of time-​series data; see section 11.4.
In regression as prediction one starts with the y-​variable and then looks
for x-​variables to predict (variation in) y. For the remaining three purposes of
regression analysis, it is more the other way around: The interest lies, again to
press the point, in how x1 and perhaps x2 and x3 affect some trait or process
which y “happens” to serve as the measurement of. We start with the least
ambitious of these: Regression as (pure) description.
2. Regression as description.2 Without explicitly acknowledging it, regres-
sion was in Chapters 2 and 3 of this book for all intents and purposes used
only to describe relationships between variables in the data at hand. No focus
whatsoever was on what happened beyond the data. Whereas Chapter 2 was
about the unconditional linear relationship between x1 and y, Chapter 3 was
about the conditional linear relationship between x1 and y, where conditional
meant holding x2 and x3 and so on constant. When I say least ambitious, this
should be taken positively in the sense that regression as description is the
least assumption-​demanding regression purpose. That is, a descriptive regres-
sion study has no need for the heavy A1 assumption (i.e. all relevant inde-
pendent variables are included in the regression model; see section 5.2) and
the C assumptions (see section 5.4). A case study illustrating the descriptive
regression purpose follows.
Regression as description: A case study. The faculty members of a Norwegian
university college have answered a questionnaire about gender inequality in
academia. The y-​question reads “To what extent do you think more effort
should be put into policies aiming at obtaining more gender equality in
Norwegian academia?”. The answer categories range from “to a very little
extent” (1) to “to a very large extent” (10). A mean of 7.25 (SD = 2.38) suggests
most faculty members wanting more focus on gender-​equality increasing
efforts. Do female and male faculty members feel the same way about this
y-​question? And do “similar” female and male faculty members feel the same
way about it? Table 10.1 answers these questions through linear regression in
a descriptive manner. Model A includes only gender as the independent vari-
able; Model B has gender and some controls as independent variables.
Note that Table 10.1 is a customary way of presenting regression results in
academic settings. Note also that the table is based on Stata-​outputs similar to
those presented in Chapters 2 and 3. For this reason, there is no need to repeat
the Stata-​commands and the Stata-​output.
Before turning to the results of Table 10.1, it is important to note that
the analyses in the table refer to a population and not to a (representative)

a.indd 164 27-Aug-19 15:04:51


165

Regression purposes in academic settings 165


Attitude towards future gender-​equality enhancing policies in academia by
Table 10.1 
gender and controls for all faculty members in a Norwegian university college.
Linear regression. N = 162.

Independent variables : Model A Model B

Gender (female = 1, male = 0) 2.24 1.80


(0.33) (0.34)
Opinion on successfulness of gender equality policya -​0.15
(0.08)
Have a doctorate (yes = 1, no = 0) -​0.70
(0.32)
Political orientation:b
Center (= 1) -​0.54
(0.36)
Right-​wing or other (= 1) -​1.36
(0.42)
Number of children below 12 years -​0.50
(0.17)
Constant 6.31 8.61
R2 0.22 0.42

Note. Standard errors are in parentheses. Model B also controls for department (six dummies).
a The question reads “To what extent do you think Norway has had a successful gender equality
policy for academia so far?”, with answer categories ranging from “to a very little extent” (1) to
“to a very large extent” (10).
b Left-​wing political standpoint is the reference.

sample. For this reason, and since it makes little sense to think of the univer-
sity college in question as representative of all such colleges in Norway, the
inferential statistics covered in Chapter 4 and the C assumptions mentioned
in section 5.4 are not necessarilly very relevant. In any event, the focus here is
strictly on what descriptively happens in the data at hand.
Model A shows that the mean of the more-​gender-​equality-​effort scale is
6.31 for men (the constant) and 8.55 for women (6.31 + 2.24 = 8.55), or that
women on average score 2.24 points higher than men on this scale from 1
to 10. In other words, female faculty members are more in favor of gender-​
equality enhancing policies than male faculty members.3 The gender variable
accounts for 22% of the variation (variance) in this equality-​effort-​scale; cf.
the R2.
Model A reports the unconditional gender difference in the effort scale.
Model B, in contrast, reports the conditional gender difference, which is 1.80.
This suggests that the controls “explain away” about 20% of the uncondi-
tional gender difference (2.24). However, the (unexplained) gender difference
(1.80) is still substantial. As for the controls, we see that those who view the
former policy on the subject as successful tend to see less need for gender-​
equality enhancing policies, i.e. they have lower scores on the effort scale.
Furthermore, those with doctorates tend to see less need for gender-​equality
enhancing policies than those without. By the same logic, left-​wingers are more

a.indd 165 27-Aug-19 15:04:51


166

166 Regression projects and the way ahead


in favor of gender-​equality enhancing policies than those with other political
views. Finally, faculty members with small children at home, and especially
more of them, are less in favor of gender-​equality enhancing policies than
those without small children. The R2 implies that the independent variables
combined explain 42% of the variation (variance) in the equality-​effort scale.
Is the coefficient of 1.80 in Model B the unbiased gender difference for
the y-​variable of interest? This question brings up the A1 assumption, i.e.
whether or not all relevant control variables are included in the regression.
And the answer is probably no, since the model most likely lacks variables that
affect the effort scale and are correlated with gender (e.g. age, type of educa-
tional background, working hours etc.). But as long as we keep a descriptive
purpose –​i.e. a focus on the conditional relationships in the data given the
variables at our disposal –​the A1 assumption is superfluous. In other words,
the gender coefficient is “unbiased” given the control variables in the model
and/​or in the data.4
Regression as description has three merits. First, it is less demanding with
respect to the number of regression assumptions that must be met. (All else
equal, the fewer assumptions that must be met, the fewer opportunities there
are for failing to meet them.) Second, although the purpose is only descrip-
tive, it is still possible to learn a lot from descriptive regressions –​as Models
A and B in Table 10.1 illustrate. Third, although the findings refer to a specific
data set or a certain case, they give rise to hypotheses for other data sets/​cases
of a similar nature.
Summing up, regression as description has a lot going for it. Moreover,
Berk (2004) suggests that regression as description should be the default pur-
pose in the social sciences. I concur. That is, more bachelor theses, master’s
theses and journal articles would benefit from “only” having a descriptive
purpose (see Berk, 2004, pp. 206–​218). With fewer regression assumptions to
meet to begin with, fewer things can go wrong.
3. Regression as making inferences beyond the data. Let’s start with a thought
experiment. Say there are 20 Norwegian university colleges of the kind
analyzed in the previous regression-​ as-​
description section. Imagine now
that the 162 faculty members are a representative sample, generated by some
random sampling procedure, from this population of university colleges. If
this is the case, the previous regression table (Table 10.1) will look something
like Table 10.2 in a typical scientific journal in the social sciences/​economics.
The new features in Table 10.2 are the asterisks alongside the regression
coefficients and their significance levels explicated at the bottom of the table.
These are relevant because the data now refer to a representative sample from
a population; hence the need to account for the fact that a regression coeffi-
cient may vary in size in repeated sampling due to randomness.5 With this in
mind, the gender coefficient of 1.80 is significant at the 0.1% significance level
(***). That is, the probability of getting a gender coefficient of 1.80 if the
true coefficient in the population is zero (what H0 suggests) is less than 0.1%.

a.indd 166 27-Aug-19 15:04:51


167

Regression purposes in academic settings 167


Attitude towards future gender-​equality enhancing policies in academia by
Table 10.2 
gender and controls for all faculty members in a Norwegian university college.
Linear regression. N = 162.

Independent variables : Model A Model B

Gender (female = 1, male = 0) 2.24*** 1.80***


(0.33) (0.34)
Opinion on successfulness of gender equality -​0.15
policya (0.08)
Have a doctorate (yes = 1, no = 0) -​0.70*
(0.32)
Political orientation:b
Center (= 1) -​0.54
(0.36)
Right-​wing or other (= 1) -​1.36**
(0.42)
Number of children below 12 years -​0.50**
(0.17)

Constant 6.31 8.61


R2 0.22 0.42

Note. Standard errors are in parentheses. Model B also controls for department (six dummies).
a The question reads “To what extent do you think Norway has had a successful gender equality
policy for academia so far?”, with answer categories ranging from “to a very little extent” (1) to
“to a very large extent” (10).
b Left-​wing political standpoint is the reference.
* p < 0.05; ** p < 0.01; *** p < 0.001 (two-​sided tests).

(Recall: The t-​value is obtained by dividing the regression coefficient by its


standard error. If the resulting t-​value is larger than 1.96, the regression coef-
ficient is significant at the 5% level (*) at a minimum in large samples.)
The opinion-​variable is not significant at the 5% level; not shown the exact
p value is 0.077. Yet if the alternative hypothesis is that the opinion-​variable
should have a negative effect (which makes sense theoretically), a one-​sided
significance test is justified. This makes the opinion-​variable’s effect significant
at p = 0.039, i.e. below the 5% level. Save for the center political standpoint
dummy, the remaining regression coefficients are significant at conventional
levels.
The bulk of regression studies in the social sciences fall into this regression
as making inferences beyond the data category. For these the C assumptions
(see section 5.4) also come into play.6 Berk (2004, p. 219) nevertheless argues
that significance tests and p values should be used sparingly, since “…in
practice, even well-​designed probability samples are usually implemented
imperfectly, and virtually all regression models are wrong”. A quick glance
through journals in the social sciences and economics will tell you that
his advice has largely been ignored. In my personal view, I still find Berk’s
advice sound.

a.indd 167 27-Aug-19 15:04:52


168

168 Regression projects and the way ahead


To sum up, regression as making inferences beyond the data rests much
on how the data or sample was generated. At the least, some random sam-
pling procedure is usually called for at the outset of the data generation pro-
cess if significance tests and p values are to be fully meaningful.7 Also, the C
assumptions must be met. Otherwise this regression purpose or motivation
has lots in common with the pure description strategy.
4. Regression as causal analysis. Chapter 9 introduced IV regression as a pos-
sible means of getting an unbiased regression coefficient that might be given a
causal interpretation. This method, however, rests on strong assumptions to
yield a trustworthy causal effect. Another method, the randomized controlled
trial (RCT), is the gold standard when it comes to revealing a causal effect. The
RCT was mentioned in passing in note 1 to Chapter 9, and here is a recap: 100
patients sharing the same condition to varying degrees get divided into two
groups according to some random procedure like the flip of a coin. Heads
entail that a patient gets a pill containing real medicine, whereas tails bring
about that a patient gets a so-​called placebo pill, i.e. no medicine. The ran-
domness of the coin toss ensures about 50 patients in each group, and neither
patients nor doctors know if the pill contains real medicine or a mix of salt and
powder. (This is why the procedure is called double-​blind.) The patients eat
pills for a few weeks and then undergo a diagnostic test, i.e. the dependent vari-
able/​y. The medicine/​placebo variable may now be thought of as the dummy
x1, where real medicine = 1 and placebo = 0. Therefore, the b1 for the regres-
sion of y on x1 yields the average causal effect of the medicine or, in general
terms, the treatment effect. Why? The trick or “magic” lies in the random
assignment of medicine versus placebo. Since randomization makes the two
groups (treatment or placebo) similar in every respect on average, the only thing
separating them –​i.e. actual versus fake medicine –​must be the cause of vari-
ation/​difference in y.8
In regression-​speak, randomization under ideal settings ensures that there
is no correlation between x1 and the other independent variables (potentially)
bringing about variation in y. For this reason, if controls were to be added to
the regression, the estimate of b1 will typically not change by much.9 The b1
for x1 on y is thus unbiased and might be given a causal interpretation.
The RCT design, and experiments in general, have not been common in
the social sciences except in psychology. But times are changing. In economics
and political science, for example, we have seen a flourishing of experimental
studies of late. Here, the treatment versus placebo assignment could be the
joining of a training program or not, joining program A or program B, or to
being exposed to scenario/​policy A or B and so forth.10 This is the dummy
x1. At some later point in time the y of interest is measured, and the b1 for x1
hopefully captures the average treatment or causal effect. I illustrate the gen-
eral idea with an example below.
Regression as causal analysis. A case study. A number of studies in economics
and marketing have been concerned about what makes motion pictures

a.indd 168 27-Aug-19 15:04:52


169

Regression purposes in academic settings 169


financial successes or not.11 The observations for these studies are movies
shown in cinemas/​movie theatres. The dependent variable, y, is movies’ box
office revenues. Typical independent variables include movie genre, movie
nationality, budget, critical acclaim by experts, word-​ of-​
mouth (WOM),
PGA-​rating etc. Some of these studies have had a predictive purpose, as in
trying to find the combination of x-​variables that maximizes the R2 with
respect to box office revenues. Others, in contrast, have had an explanatory or
causal purpose, i.e. to pinpoint the exact causal effect of a certain x1 on box
office revenues. The latter type of research question is what interests us here.
The fundamental problem in finding the causal effect of, say, WOM on box
office revenues is that the regression model might lack one or more important
control variables, typically for data availability reasons. IV regression could
in principle solve this, but relevant and exogenous instrument variables are
hard to find.12 Another strategy, the one taken here, is to go one step back and
examine whether or not potential movie-​goers are affected by WOM when
making their movie-​seeing decisions.
After an introduction to be described shortly, the respondents in the
upcoming analysis were asked “On a scale from 1 (very unlikely) to 10 (very
likely), how probable is it that you will see the movie?”. This movie-​seeing
intent is the y-​variable. The introduction to this question reads: “You are
alone an afternoon with nothing special to do, and you have 100 NOK
to spend on anything you like. You notice a movie you have heard of is
showing at your local cinema”. (100 NOK is the cost of a movie ticket.)
After this introduction, the questionnaire came in two versions according
to a random procedure.13 One group of respondents was told that a friend
had recommended the movie in question as very worthy of seeing, i.e.
x1 = 1 or WOMtreatment. The other group of respondents was told that a
friend had recommended the movie in question as somewhat worthy of
seeing, i.e. x1 = 0 or WOMplacebo.14 In other words, b1 is the mean diffe-
rence in movie-​seeing intent between a movie somewhat worth seeing and a
movie very much worth seeing. Or, stated differently, it is the difference in
movie-​seeing intent caused by WOM as defined here. The analysis pertains
to students. Panel A in Table 10.3 presents seven bivariate regressions, i.e.
one dependent intent-​variable for each movie genre.15 The table, once again
publication style, is based on Stata-​outputs similar to those presented in
Chapters 2 and 3, and for this reason there is no need to repeat the Stata-​
commands and the Stata-​output.
Column AA in Panel A refers to the intent of seeing an action/​adventure
movie. If a friend assessed such a movie as somewhat worthy of seeing, it has a
seeing-​intent of 5.49 on the scale from 1 to 10, i.e. the constant. If the “same”
movie was assessed by a friend as very worthy of seeing, the seeing-​intent is
1.13 points larger –​or 6.62. That is, the treatment or causal effect of WOM
is 1.13 for an action/​adventure movie. Similar effects are obtained for the
other genres, save for the Norwegian movies (p > 0.05). Panel B reports the
analogous WOM-​effects adding control variables; see the note to Table 10.3.

a.indd 169 27-Aug-19 15:04:52


170

170 Regression projects and the way ahead


Table 10.3 Intent to see movie in cinema by WOM.Linear regression. N = 226–​229.

AA1 SD2 LD3 SF4 COM5 NOR6 DOC7

Panel A
WOM:a
Movie very 1.13** 1.04** 1.07*** 0.84* 1.15*** 0.56 1.26***
worthy of (0.353) (0.343) (.327) (0.359) (0.341) (0.354) (0.370)
seeing
Constant 5.49 3.36 3.81 4.31 5.71 3.92 2.64
Controls: No No No No No No No
R2 0.18 0.12 0.10 0.09 0.13 0.05 0.12
Panel B
WOM:a
Movie very 1.22*** 0.99** 1.04*** 0.893* 1.16*** 0.51 1.21***
worthy of (0.348) (0.342) (0.317) (0.358) (0.339) (0.356) (0.371)
seeing
Constant 5.63 3.26 5.77 4.36 7.37 4.29 2.64
Controls: Yes Yes Yes Yes Yes Yes Yes
R2 0.23 0.15 0.18 0.13 0.17 0.06 0.14

Note. Standard errors are in parentheses. Control variables are gender, age, number of cinema
visits during the previous two months and number of movie-​streaming subscriptions.
a = Reference: Movie somewhat worthy of seeing.
1 = Genre: Action and adventure.
2 = Genre: Serious drama.
3 = Genre: Light drama/​comedy drama.
4 = Genre: Science-​fiction.
5 = Genre: Comedy.
6 = Genre: Norwegian movie.
7 = Genre: Documentary.
* p < 0.05; ** p < 0.01; *** p < 0.001 (two-​sided tests).

The regression coefficients for WOM are in the main unaffected by adding of
controls, exemplifying how the experimental design makes control variables
more or less redundant.16
The analyses above tacitly assume that the causal effects are similar for all
subgroups in the data, such as males and females. To test if this is plausible or
not, one might add interaction variables to the regression models in the usual
manner (see section 6.2). This is often referred to as examining effect hetero-
geneity in the lingo of regression analysis for experimental data.
Regression as causal analysis: Some final thoughts. Finding a causal effect is
the most ambitious social science research endeavor. Whether or not plain
vanilla regression analysis might be used for this purpose depends primarily
on how the data up for analysis came about, i.e. the data generating process.
From a purist stance one must have some sort of manipulation of x1 if we are
even to start talking about causal effects. That is, for some, causal analysis
is for experimental data only. But under such circumstances, as illustrated
above, regression fits the bill perfectly as a statistical tool of choice. This
brings the old adage “correlation is not causation” to the fore. When people

a.indd 170 27-Aug-19 15:04:53


171

Regression purposes in academic settings 171

Abstract
1. Introduction
1.1 The study purpose
1.2 Theory and prior research
1.3 Precise research goal/​hypothesis
2. Data and methods
2.1 Presentation of data
2.2 Presentation of main variables
2.3 Presentation of regression results
3. Conclusions and implications
3.1 Study purpose revisited
3.2 Main findings and conclusions
3.3 Implications for future research and policy

Figure 10.1 A typical structure for a regression-​based academic study.

make this claim, they tend to mean that regression cannot be used to prove
causation. The analyses above –​as well as regression on experimental data in
general –​attest to the contrary. The key point in this respect is that it is first
and foremost the data generating process, and not the statistical tool per se,
that allows or disallows causal inference.
Despite experimental designs having become more in vogue in the social
sciences of late, they are nevertheless not viable in many situations –​for eth-
ical, practical, political or economic reasons.17 This leaves the quantitative
social scientist with “only” the observational data of this book. Should causal
aspirations be abandoned here? Taking the purist stance on causal analysis,
the answer is certainly yes. A more pragmatic stance, perhaps a tentatively
more positive approach, lies in the use of IV regression or making sure the
linear regression model includes all relevant control variables. Other empir-
ical strategies for obtaining causal effects for observational data are discussed
briefly in sections 11.4 and 11.6.

10.3 The structure of an academic regression-​based study


An academic regression-​ based study typically finds itself somewhere on
the continuum from having mostly a descriptive purpose to having a tenta-
tive causal inference aspiration. And most often such a study will be better
off if its place on this continuum is more clearly defined. That said, most
regression-​based studies in academic settings have lots of similarities in terms
of how they proceed from the initial research question/​hypothesis to the final
conclusion. This section spells this structure or setup out in detail.

a.indd 171 27-Aug-19 15:04:53


172

172 Regression projects and the way ahead


A bachelor’s thesis, a master’s thesis and, in many cases, a journal paper
using regression as its main research tool have a number of common features
in terms of how they are organized as a written document. Figure 10.1
summarizes this, and I comment briefly on it below.18
More on abstract. This is a short summary of the research question/​hypoth-
esis, the main findings and the conclusions. It is, in its final form, written at the
tail end of the research process, i.e. when everything else is finished.
More on 1.1. In this sub-​section we paint the research question/​hypothesis as
well as the background for it in broad strokes. Also, we often state why the
research question/​hypothesis is interesting from an academic and/​or policy per-
spective. As a general rule this sub-​section should be easy to follow no matter
what the type of background the reader has. Some prefer to divulge or “sell”
the conclusions of the study in this section; others save this for section 3.19
More on 1.2. The theories and prior research shedding light on the present
research question are presented in this sub-​section. Some kind of structuring
of the material is required, such as dividing theories/​prior studies into groups
according to some criteria or using the time of publication as the ordering
device. As a general writing-​style tip, the upcoming sub-​section 1.3 should
flow “naturally” from this sub-​section.
More on 1.3. The research question/​hypothesis is stated formally and precisely
here, “growing” out of sub-​section 1.2. It is often typical to provide a road
map for the remaining part of the text here.
More on 2.1. This sub-​section, as the heading implies, describes how the data
up for analysis came about as well as their strengths and weaknesses. A more
thorough description is often necessary if the data are novel, i.e. collected
exclusively (perhaps by you) for the research question/​hypothesis at hand.
More on 2.2. The means, standard deviations and extreme values for the
variables are displayed in this sub-​ section. Furthermore, precise variable
definitions/​wording of questions are offered here, as exemplified a number of
times in this book up until now.
More on 2.3. This is the place for (doing) the regressions and their interpret-
ations, i.e. what this book mainly is about.20 The tables in section 10.2 are
typical in terms of presenting regression results. Graphs (plots) may also be
considered as a supplement to tables.
More on 3.1. The research question/​hypothesis and its justification is repeated
here. It is often good practice that a reader not interested in statistical tech-
nicalities should be able to “jump” from 1.3 to 3.1 without losing his or her
train of thought.
More on 3.2. The main message regarding the results of the regressions and
how these relate to the theory/​prior research is repeated here. If this is thor-
oughly explicated in 2.3, this sub-​section is typically brief. In contrast, if the

a.indd 172 27-Aug-19 15:04:53


173

Regression purposes in academic settings 173


result presentation in 2.3 uses a bare-​bones strategy, section 3.2 will typically
also include a more exhaustive discussion of the findings.
More on 3.3. What we have learned from the study and what others (e.g.
researchers, policy makers) might learn from it finalizes the project. The study
limitations are also mentioned in this sub-​section.
The structure in Figure 10.1 is not a fixed prescription in the sense that
that a regression-​based study should look like this. That said, however, I have
learned so far that students embarking on their first academic regression trip
benefit from having a road map as guide. There is also much to be learned
in this respect by studying the structure of earlier academic projects (e.g.
master’s theses, working papers, journal articles etc.) on the topic in question
and related ones.

10.4 Some practical advice


Section 10.2 put the regressions of this book into the wider context of doing
an academic regression project, and showed some examples of how regres-
sion results typically are presented in academic settings. Section 10.3 then
briefly summarized how such regression projects are organized as written
documents. I close with some practical advice on how to actually proceed
with a regression-​based academic project.21
It is often emphasized that theory, prior research and common sense
should guide (i) the choosing of independent (control) variables and (ii) how x
should be related to y when doing regression analysis –​something it is foolish
to argue against. But sometimes, and perhaps more often than we appreciate,
these bases of knowledge take us only so far. To actually cross the finish line in
an academic project, however, we must often build our regression models also
on what exploratory analyses of the data tell us. That is, applied regression
analysis is often a mix of deductive reasoning (from theory to data) on the one
hand and inductive thinking (from data to theory) on the other. Regarding
the latter, here are some useful tips.
Analyze one independent variable at the time! Once the data have been
collected/​downloaded and cleaned, which involves much work we cannot dive
into here (but see Stata’s YouTube channel), it is tempting to jump right at
multiple regression. Don’t! Start instead by inspecting how the key-​interest
independent variable, x1, appears to be related to y. Also, do a “visual regres-
sion” (cf. Figure 1.2) of this x1-​y relationship. (If your data contain more than
500 observations, use your statistics program to draw a random sample of
200 observations, and do the preliminary regressions on this sample.) Check
the linearity assumption (cf. Figure 5.1) to highlight a need for potential data
transformations (e.g. squared terms, logarithms), look for possible influential
outliers (cf. Figure 5.2) and check for potential interaction effects with key
control variables (cf. section 6.2). Repeat this process for x2 and x3 and so on.
Then, and only then, begin the process of building more complex multiple
regression models, possibly involving (more) interaction effects.

a.indd 173 27-Aug-19 15:04:56


174

174 Regression projects and the way ahead


Don’t do stepwise regression. Stepwise regression is a technique some
textbooks devote attention to. A bit simplified, stepwise regression lets the
statistics program “choose” the independent variables for you, based on the
explanatory power they each have on y. This might sound fabulous (“I don’t
need to think for myself!”), but it is generally not. One of many problems is
that you run the risk of the statistics program throwing out “your” x1 from
the regression –​which of course is problematic given your present research
question or hypothesis. That said, stepwise regression might come in handy
in exploratory work in settings where you have a large pool of candidates as
independent variables and little theory/​prior research to guide you. Also, step-
wise regression might be tried out in purely predictive work.

Notes
1 When y is binary or ordered, which also makes for prediction, some equivalent
pseudo R2 measure or proportion of correct classifications is typically used.
2 The remaining part of section 10.2 draws heavily on Berk (2004).
3 Without some benchmark it is difficult to say what constitutes a large difference
between two means. Cohen’s d (Cohen, 1988) is typically used as an effect size
measure in this regard. The Cohen’s d for this gender difference is 1.08, which
typically is thought of as a “large” effect in the social sciences (d > 0.80 = large
difference).
4 Provided there is no need for squared terms/​sets of dummies to capture non-​
linearties or product terms to capture interaction effects. The use of logarithms is
not relevant here since the y-​variable has rather few values.
5 If not stated explicitly otherwise, Stata and other statistics programs always
assume by default that the data in question are a random sample from some larger
population. For this reason, they always calculate inferential statistics.
6 The A1 assumption plays pretty much the same role in descriptive studies as in
inference beyond the data studies.
7 The alternative justification for significance testing is so-​called model-​based signifi-
cance testing; see Hoem (2008) or Lumley (2010) for more on this.
8 There is a flavor of Sherlock Holmes-​ logic in play here: If all other causes
(suspects) have been ruled out by logic/​evidence, the remaining cause (suspect)
must be “guilty”, however improbable that sounds. See also Pearl and Mackenzie
(2018) or Rosenbaum (2017).
9 Of course, since settings are never ideal, there might be tiny correlations (positive
or negative) between x1 and the controls. This is typically why b1 might change
somewhat in size when controls are added to the model. Remember: A relevant
control variable affects y and correlates with x1. Yet since the RCT by its design
ensures that the latter correlation is close to zero if all goes well, the A1 assumption
becomes superfluous.
10 More complicated designs involve three treatments or one placebo and two
treatments.
11 References to this literature are found in Thrane (2018).
12 Note also that the units analyzed in this research are movies, and not people
making actual movie-​seeing decisions.

a.indd 174 27-Aug-19 15:04:56


175

Regression purposes in academic settings 175


13 The design is called a survey experiment. The two slightly different versions of the
questionnaire were randomly sorted and distributed in several auditoriums and
public areas around campus.
14 Since not recommended as in not worthy of seeing was deemed an artificial pla-
cebo, the somewhat-​option was preferred. An alternative approach could be to
leave out the placebo “friend story” altogether.
15 Since there were few students in some auditoriums/​public areas, all analyses con-
trol for place of interview by a set of dummies.
16 The R2 increases due to the effects of the controls. Also, the standard errors of
the WOM-​coefficients get slightly reduced. Both features are typical when adding
controls in regressions carried out on experimental data.
17 RCTs/​experimental designs also have a number of imperfections, making them no
panacea for identifying causal effects. See Deaton and Cartwright (2018) for more
on this.
18 There are, of course, a number of variations on this setup throughout the eco-
nomic and social sciences. The reader should therefore consult his or her own field
of research before settling for a final structure.
19 The main conclusions are offered along with the research questions in a typical
journal paper in economics. In for example marketing and psychology, in contrast,
the conclusions are more often placed at the tail end of the paper. See also King
(2006).
20 There is typically no need to explicate how plain vanilla linear regression works. In
the case of more complicated methods (e.g. logit, IV regression etc.), however, a
short introduction to how the method in question works is often provided.
21 Kennedy’s practical advice paper is highly recommended in this regard
(Kennedy, 2002).

a.indd 175 27-Aug-19 15:04:56


176

11 
The way ahead
Related techniques

11.1 Introduction
The time has come to look forward towards other, related statistical techniques
that might be handy in a regression analyst’s toolbox. These techniques, it
should be emphasized, are often tailor-​made and more statistically complex
solutions to the idiosyncrasies in the data up for analysis. Yet they are gen-
erally straightforward to implement given a firm grip on the plainer vanilla
regressions offered in this book so far.

11.2 Related technique 1: Median and quantile regression


Linear regression is about what happens to the conditional mean of y if x1
or x2 and so on increases by one unit. But the mean is only one measure of
y’s central tendency; another and a more robust measure in this regard is the
median. The median splits the distribution of y on its middle value, i.e. the 50/​
50-​point. Without going into technicalities (cf. Koenker, 2005), it is “easy” in
Stata or any other sophisticated statistics program to estimate median regres-
sion models (the general command in Stata is: qreg y x1 x2).
In Stata-​output 2.2 we saw the linear regression of house sale prices on house
sizes yielding a regression coefficient of 2 630 €, suggesting that a 101 square
meter house costs 2 630 € more on average than a 100 square meter house.
Stata-​output 11.1 reports the analogous median regression, which roughly is
interpreted in the same way as the linear model. The median regression coef-
ficient is in our case only a bit smaller, i.e. 2 224 €. In other cases, especially
when the mean of y differs from its median (i.e. for skewed distributions), the
results of median and linear regressions might diverge more.
Other cut points for the distribution of y, called quantiles, are the 25/​75-​
point and the 75/​25-​point (or 10/​90 and 90/​10). It is also straightforward to
estimate such quantile regression models; Stata-​output 11.2 illustrates. The
q50 is the median regression (having the same b as before), whereas the q25
and q75 quantile regression coefficients are, respectively, 2 661 € and 2 619 €.
That is, the effects of house sizes on house prices appear constant throughout
the distribution of prices. In other settings this might not be the case.

a.indd 176 27-Aug-19 15:04:56


177

Related techniques 177


Stata-​output 11.1 Median regression of house sale prices (price_​Euro) on house
sizes (square_​m).

. use "C:\Users\700245\Desktop\temp_myd\housedata.dta", clear

. qreg price_Euro square_m


Iteration 1: WLS sum of weighted deviations = 3748188.7

Iteration 1: sum of abs. weighted deviations = 3745076


Iteration 2: sum of abs. weighted deviations = 3730835.4
Iteration 3: sum of abs. weighted deviations = 3726460.9

Median regression Number of obs = 100


Raw sum of deviations 5649583 (about 441000)
Min sum of deviations 3726461 Pseudo R2 = 0.3404

------------------------------------------------------------------------------
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
square_m | 2223.958 315.2902 7.05 0.000 1598.275 2849.642
_cons | 134130.2 45949.42 2.92 0.004 42945.08 225315.3
------------------------------------------------------------------------------

Stata-​output 11.2 Quantile (q25, q50, q75) regression of house sale prices (price_​
Euro) on house sizes (square_​m).

. sqreg price_Euro square_m, quantile (.25 .5 .75)


(fitting base model)

Bootstrap replications (20)


----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
....................

Simultaneous quantile regression Number of obs = 100


bootstrap(20) SEs .25 Pseudo R2 = 0.3414
.50 Pseudo R2 = 0.3404
.75 Pseudo R2 = 0.3034

------------------------------------------------------------------------------
| Bootstrap
price_Euro | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
q25 |
square_m | 2661.458 156.0873 17.05 0.000 2351.708 2971.209
_cons | 4374.998 23900.16 0.18 0.855 -43054.1 51804.09
-------------+----------------------------------------------------------------
q50 |
square_m | 2223.958 351.2313 6.33 0.000 1526.951 2920.966
_cons | 134130.2 60473.29 2.22 0.029 14122.91 254137.5
-------------+----------------------------------------------------------------
q75 |
square_m | 2619.048 346.9013 7.55 0.000 1930.633 3307.462
_cons | 132619 49833.23 2.66 0.009 33726.61 231511.4
------------------------------------------------------------------------------

11.3 Related technique 2: Count data regression


In certain settings, your y measures how many times something has happened
during a given period; it is the count of something. Some examples are
number of cigarettes smoked per day, number of times exercising during a
14-​day period, number of scientific papers written during a five-​year period

a.indd 177 27-Aug-19 15:04:57


178

178 Regression projects and the way ahead


and number of goals scored during a soccer season. In other words, y takes on
only whole numbers from 0 to… Such y-​variables may very well be analyzed
with the regression models covered in Chapters 3 and 4 of this book, but it has
become more and more typical for reasons we do not get into to use tailor-​
made count data regression models in these circumstances.
There are two main types of count data regression models: the Poisson
model and the Negative binomial model. Yet since the Poisson model is more
assumption-​demanding than the Negative binomial (NB) model (cf. Cameron
and Trivedi, 2013), I only consider the latter here. Using the data foot-
ball_​earnings.dta, the dependent variable is number of goals scored
during the season (goals_​season). The independent variables are number
of matches played during the season (games_​season) and the dummies
midfield player (midfield; 1 = yes, 0 = no) and offensive player (striker;
1 = yes, 0 = no). The reference/​constant is thus a defensive player. The NB
model’s results appear in Stata-​output 11.3.

Stata-​output 11.3 Number of goals scored in season (goals_​season) by inde-


pendent variables.
Negative binomial (NB) regression model.

. use "C:\Users\700245\Desktop\temp_myd\football_earnings.dta"

. nbreg goals_season games_season midfield striker

Output omitted …

Negative binomial regression Number of obs = 242


LR chi2(3) = 132.15
Dispersion = mean Prob > chi2 = 0.0000
Log likelihood = -419.85712 Pseudo R2 = 0.1360

------------------------------------------------------------------------------
goals_season | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
games_season | .0653718 .0109971 5.94 0.000 .0438178 .0869257
midfield | .9295584 .1661253 5.60 0.000 .6039587 1.255158
striker | 1.827004 .1734234 10.53 0.000 1.487101 2.166908
_cons | -1.632941 .2789111 -5.85 0.000 -2.179596 -1.086285
-------------+----------------------------------------------------------------
/lnalpha | -.7654989 .2286809 -1.213705 -.3172925
-------------+----------------------------------------------------------------
alpha | .4651018 .1063599 .2970944 .7281177
------------------------------------------------------------------------------
LR test of alpha=0: chibar2(01) = 63.18 Prob >= chibar2 = 0.000

Since the NB coefficients have little intuitive appeal on their own, the
thing to notice in the output is their sign and significance level. Note that
players with more appearances on the pitch during the season score more
goals, and that midfielders and offensive players score more goals than
defenders. All three results are much as expected, and all three coefficients
are significant.1

a.indd 178 27-Aug-19 15:04:58


179

Related techniques 179


To get a firmer grip on the magnitude of the effects, we let Stata transform
the coefficients into marginal effects (as we did in Chapters 7 and 8). Stata-​
output 11.4 presents the results.

Marginal effects for the NB regression model reported in


Stata-​output 11.4 
Stata-​output 11.3.

. mchange

nbreg: Changes in mu | Number of obs = 242

Expression: Predicted number of goals_season, predict()

| Change p-value
-------------+----------------------
games season |
+1 | 0.154 0.000
+SD | 1.526 0.000
Marginal | 0.149 0.000
midfield |
+1 | 3.490 0.001
+SD | 1.261 0.000
Marginal | 2.115 0.000
striker |
+1 | 11.869 0.000
+SD | 2.420 0.000
Marginal | 4.158 0.000

Interpretation: One more match played during the season increases the
number of goals scored by 0.15 on average, all-​else-​equal and interpreted
dynamically. On average, midfielders score 2.12 more goals than defenders
all-​else-​equal, whereas the similar difference between defenders and offensive
players is 4.16 goals.
Excess zeros. One sometimes encounters that the y of interest contains an
“excess” number of zeros, meaning that the zero count is much more fre-
quent (relatively) than positive counts. A case in point is number of cigarettes
smoked in a day. Here, all non-​smokers, typically 80% or more in a “normal”
sample of adults, report zero. (This was partly also the case for the goal-​
scoring model estimated above; 40% of the players scored zero goals during
the season.)
To accommodate excess zeros, we use the zero-​inflated count data regression
models. In these models the independent variables affect two processes: (1)
y = 0 versus y > 0 and (2) amount of y, given y > 0. I use the data football_​
earnings.dta to illustrate again. The y is total number of international
matches played in the career (games_​int; 60% have zero international
matches), and the independent variables are age and performance. Stata-​
output 11.5 displays the results.

a.indd 179 27-Aug-19 15:04:59


180

180 Regression projects and the way ahead


Stata-​output 11.5 Number of international matches in career (games_​int) by inde-
pendent variables.
Zero-​inflated negative binomial regression.

. zinb games_int age performance, inflate(age performance)

Fitting constant-only model:

Output omitted …

Zero-inflated negative binomial regression Number of obs = 242


Nonzero obs = 98
Zero obs = 144

Inflation model = logit LR chi2(2) = 6.86


Log likelihood = -467.3308 Prob > chi2 = 0.0324

------------------------------------------------------------------------------
games_int | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
games_int |
age | .0681041 .0311644 2.19 0.029 .007023 .1291852
performance | .410101 .2730644 1.50 0.133 -.1250954 .9452975
_cons | -1.549452 1.661591 -0.93 0.351 -4.806111 1.707207
-------------+----------------------------------------------------------------
inflate |
age | -.2276064 .0454371 -5.01 0.000 -.3166615 -.1385512
performance | -2.223691 .4688857 -4.74 0.000 -3.14269 -1.304692
_cons | 16.52414 2.744172 6.02 0.000 11.14567 21.90262
-------------+----------------------------------------------------------------
/lnalpha | .3715287 .2249229 1.65 0.099 -.0693122 .8123696
-------------+----------------------------------------------------------------
alpha | 1.449949 .3261269 .9330353 2.253241
------------------------------------------------------------------------------

The top coefficients tell us that age has a significant and positive effect
on number of international matches played, given that a player has played
at least one such match. In contrast, the performance variable’s similar
effect is insignificant. The bottom coefficients say that both independent
variables have negative and significant effects on not having played an
international match –​or positive effects on having played one or more such
matches. That is, both independent variables affect whether a player is “an
international” or not. More refined interpretations of effects may also
be done for the zero-​inflated count data regression model; see Long and
Freese (2014). The standard reference for count data regression models is
Cameron and Trivedi (2013).2

11.4 Related technique 3: Panel data regression and time-​series


regression (longitudinal data)
Panel data regression. We have analyzed soccer players’ earnings repeatedly
in this book. Imagine that for five consecutive seasons we collect data on
these players’ earnings, performance, number of international matches etc.
Then we would have five observations for each player, provided he played all
seasons. For those playing four seasons, we would have four observations and
so on. That is, we have repeated or multiple observations for each unit (i.e.

a.indd 180 27-Aug-19 15:05:00


181

Related techniques 181


player) in the data over time. Panel data is the name for such longitudinal
cross-​sectional data.
In the cross-​sectional regression of players’ earnings on their performance
(see Stata-​output 2.8 and Figure 2.5), we saw that that players with better per-
formance earned more on average than worse-​performing players. But since
both the dependent and the independent variable referred to the same season,
we could not make the more theoretically intriguing claim that performance
affects future earnings. With panel data this latter claim may be addressed
easily –​we simply regress the earnings in one year (t) on the performance in a
forgoing year (t –​1). In this way, panel data regression solves the simultaneity
problem mentioned in passing in note 3 in Chapter 9.
Solving simultaneity is but one cheer for panel data; a second is the possi-
bility of getting closer to revealing causal effects among variables that might
change values from year to year.3 For such variables we may examine how the
change in x1 from year t to t + 1 is associated with the change in y from year t
to t + 1 or from t + 1 to t + 2. This is called fixed-​effects regression analysis,
which might give credence to a causal interpretation of x1.4 Given a solid
grip on the cross-​sectional regressions in this book, however, the adoption
of the similar panel data regression models is straightforward in the main.
Wooldridge (2010) has more to offer on the subject.
Time-​series regression. With panel data we follow lots of units, typically people
or firms, over some period of time. For time-​series data, in contrast, we follow
one y-​unit –​the stock price of Apple, the U.S. GDP, the unemployment rate
in Great Britain or any other y-​unit that fluctuates –​through time (by year,
month, week, day, second, microsecond etc.). Also, we collect data on several
x-​variables following the same time registration scheme. Since this kind of
data also opens up for how changes in x1 affect future changes in y (akin to
panel data), we may come closer to pinpointing causal effects. Also, time in
itself plays a vital role as an independent variable in time-​series regressions.
From a technical or mathematical point of view, however, time-​series regres-
sion is more complex than cross-​sectional regression and panel data regres-
sion. Stock and Watson (2007) is the standard reference in this regard.

11.5 Related technique 4: Multi-​level regression models


Say y is the grade point average (GPA) of pupils in secondary schools, and that
the x-​variables are other characteristics of the pupils: year of birth, month of
birth, gender, IQ, BMI, GPA the preceding year etc. We call these x-​variables
level I variables. If we were to apply a regression on these data, the plain van-
illa regressions in Chapters 2 and 3 in this book would suffice nicely. But we
could in this case also have access to data on x-​variables that vary between
schools: percentage of fully qualified teachers, average numbers of teachers
per class etc. These are level II variables. Finally, we might even have variables
varying between school districts: the affluence of the residents, the type of

a.indd 181 27-Aug-19 15:05:01


182

182 Regression projects and the way ahead


political leadership etc. These are level III variables. When the independent
variables affecting y refer to different levels of aggregation, the regression
models covered in this book are not ideal. That is, to fully accommodate the
nested structure of the data (i.e. pupils → schools → school districts) and the
interactions between these nested levels, one should apply multi-​level regres-
sion models. Hox and Wijngaards-​de Meij (2015) is a nice introduction to
these models.

11.6 Related technique 5: Matching


This book has offered three methods if the specific aim of one’s research is to
find the causal effect of x1 on y when dealing with lots of units such as people
or firms: (1) to do IV regression (Chapter 9), (2) to do an RCT/​experiment
(section 10.2) or (3) to apply panel data regression/​fixed-​effects (section 11.4).
I’ll close with a final technique on this matter.
Matching techniques. In Chapter 3 we compared fitness center members and
non-​members in terms of hours of weekly exercise, controlling for some
independent variables. But we also imagined being able to control for every
variable affecting hours of exercise, i.e. making members and non-​members
similar in all respects save for the fitness center membership dummy. If we
were to succeed in doing this (a tall order indeed!), a causal interpretation
of the membership dummy would be warranted, akin to how randomization
in a RCT justifies a causal (treatment) effect. A third way to make two or
more groups “fully comparable” with regard to some y is to apply (propen-
sity score) matching. This involves using variables other than y in the data to
artificially create a treatment group and a placebo group, and then comparing
these two groups with respect to y.5 In practice, however, matching is by no
means a simple doing, and it is not a silver bullet for obtaining a causal effect.
Gangl (2015) and Rosenbaum (2017) are both fine introductions to the topic.

11.7 Other, related techniques and further reading


Once you are familiar with the “extended” regression models covered ever
so briefly in this chapter, the what’s next question comes up. Three other
regression-​
inspired techniques are event history analysis/​ survival analysis
(Blossfeld and Blossfeld, 2015), regression discontinuity designs (Lee and
Lemieux, 2015) and difference-​in-​difference designs (Angrist and Pischke,
2009). Wooldridge (2010) still covers just about everything you ever need to
know about these regression-​inspired estimation techniques. Enjoy!

Notes
1 The significant Alpha-​test at the bottom of the output suggests that the NB model
is the better choice than the Poisson model. More technically, there is evidence of
so-​called overdispersion (cf. Long and Freese, 2014).

a.indd 182 27-Aug-19 15:05:01


183

Related techniques 183


2 In some (survey) settings the y-​count cannot take the value of zero, as in answering
the question of how many times one has visited the site one presently is visiting, i.e.
the present visit must at least be the first. So-​called zero-​truncated count models
have been developed for data for this purpose; see Cameron and Trivedi (2013) and
Long and Freese (2014).
3 Or, say, from day to day or from week to week, dependent on how often the units of
interest are measured.
4 The fixed-​effects regression controls for all time-​constant independent variables due
to its change-​over-​time design, whether they are included in the regression model or
not. For this reason, fixed effects might come closer to justifying causal interpret-
ations. That said, fixed-​effects regression is no general panacea in this regard (see
Morgan and Winship, 2015).
5 In this way, such a matching procedure could in principle have been carried for the
exercise data analyzed in Chapter 3.

a.indd 183 27-Aug-19 15:05:01


184

References

Agresti, A. and Finlay, B. (2014). Statistical Methods for the Social Sciences. Fourth
Edition. Harlow, England: Pearson Education Limited.
Ai, C. and Norton, E. C. (2003). Interaction terms in logit and probit models.
Economics Letters, 80, 123–​129.
Allison, P. D. (1999). Multiple Regression. A Primer. Thousand Oaks, CA: Pine
Forge Press.
Angrist, J. D. and Pischke, J.-​S. (2009). Mostly Harmless Econometrics. An Empiricist’s
Guide. Princeton, NJ: Princeton University Press.
Baron, R. M. and Kenny, D. A. (1986). The moderator-​mediator variable distinction
in social psychological research: Conceptual, strategic, and statistical consider-
ations. Journal of Personality and Social Psychology, 51, 1173–​1182.
Berk, R. A. (2004). Regression Analysis. A Constructive Critique. Thousand Oaks,
CA: Sage Publications, Inc.
Berry, W. D. (1993). Understanding Regression Assumptions. Newbury Park, CA: Sage
Publications, Inc.
Best, H. and Wolf, C. (2015). Logistic regression. In Best, H. and C. Wolf: The Sage
Handbook of Regression Analysis and Causal Inference, pp. 153–​171. London: Sage
Publications, Inc.
Blossfeld, H.-​P. and Blossfeld, G. J. (2015). Even history analysis. In Best, H. and C.
Wolf: The Sage Handbook of Regression Analysis and Causal Inference, pp. 359–​
386. London: Sage Publications, Inc.
Bollen, K. A. (2012). Instrumental variables in sociology and the social sciences.
Annual Review of Sociology, 38, 37–​72.
Brechot, M., Nüesch, S. and Franck, E. (2017). Does sports activity improve health?
Representative evidence using local density of sports facilities as an instrument.
Applied Economics, 49, 4871–​4884.
Breen, R., Karlson, K. B. and Holm, A. (2013). Total, direct, and indirect effects in
logit and probit models. Sociological Methods & Research, 42, 164–​191.
Cameron, A. C. and Trivedi, P. K. (2013). Regression Analysis of Count Data. Second
Edition. New York: Cambridge University Press.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences.
London: Routledge.
Deaton, A. and Cartwright, N. (2018). Understanding and misunderstanding
randomized controlled trials. Social Science & Medicine, 210, 2–​21.
DeMaris, A. (2004). Regression with Social Data. Modeling Continuous and Limited
Response Variables. Hoboken, NJ: John Wiley & Sons.

a.indd 184 27-Aug-19 15:05:01


185

References 185
Dougherty, C. (2011). Introduction to Econometrics. Fourth Edition. Oxford: Oxford
University Press.
Fox, J. (1997). Applied Regression Analysis, Linear Models, and Related Methods.
Thousand Oaks, CA: Sage Publications, Inc.
Freedman, D., Pisani, R. and Purves, R. (1998). Statistics. Third Edition.
New York: W.W. Norton & Company.
Fullerton, A. S. (2009). A conceptual framework for ordered logistic regression models.
Sociological Methods & Research, 38, 306–​347.
Gangl, M. (2015). Matching estimators for treatment effects. In Best, H. and C.
Wolf: The Sage Handbook of Regression Analysis and Causal Inference, pp. 251–​
276. London: Sage Publications, Inc.
Greene, W. H. and Hensher, D. A. (2010). Modeling Ordered Choices. A Primer.
New York: Cambridge University Press.
Gujarati, D. (2015). Econometrics by Example. Second Edition. London: Palgrave
Macmillan.
Halvorsen, R. and Palmquist, R. (1980). The interpretation of dummy variables in
semilogarithmic equations. American Economic Review, 70, 474–​475.
Hernán, M. A. and Robins, J. M. (2006). Instruments for causal inference. An
epidemiologist’s dream? Epidemiology, 17, 360–​372.
Hoem, J. M. (2008). The reporting of statistical significance in scientific journals.
Demographic Research, 18, 437–​440.
Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression.
Third Edition. Hoboken, NJ: John Wiley & Sons.
Hox, J. and Wijngaards-​de Meij, L. (2015). The multilevel regression model. In Best,
H. and C. Wolf: The Sage Handbook of Regression Analysis and Causal Inference,
pp. 11–​131. London: Sage Publications, Inc.
Imbens, G. W. (2014). Instrumental variables: An econometrician’s perspective.
Statistical Science, 29, 323–​358.
Kennedy, P. (2002). Sinning in the basement: What are the rules? The ten commandments
of applied econometrics. Journal of Economic Surveys, 16, 569–​589.
King, G. (2006). Publication, publication. Political Science and Politics, 39, 119–​125.
King, G. and Roberts, M. E. (2015). How robust standard errors expose methodological
problems they do not fix, and what to do about it. Political Analysis, 13, 159–​179.
Koenker, R. (2005). Quantile Regression. New York: Cambridge University Press.
Lee, D. S. and Lemieux, T. (2015). Regression discontinuity designs in social sciences.
In Best, H. and C. Wolf: The Sage Handbook of Regression Analysis and Causal
Inference, pp. 301–​326. London: Sage Publications, Inc.
Lohmann, H. (2015). Non-​linear and non-​additive effects in linear regression. In Best,
H. and C. Wolf: The Sage Handbook of Regression Analysis and Causal Inference,
pp. 11–​131. London: Sage Publications, Inc.
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables.
Thousand Oaks, CA: Sage Publications, Inc.
Long, J. S. (2015). Regression models for nominal and ordinal outcomes. In Best, H.
and C. Wolf: The Sage Handbook of Regression Analysis and Causal Inference, pp.
173–​203. London: Sage Publications, Inc.
Long, J. S. and Freese, J. (2014). Regression Models for Categorical Dependent Variables
Using Stata. Third Edition. College Station, TX: Stata Press.
Lumley, T. (2010). Complex Surveys. A Guide to Analysis Using R. Hoboken, NJ: John
Wiley & Sons.

a.indd 185 27-Aug-19 15:05:01


186

186 References
Martens, E. P., Pestman, W. R., de Boer, A., Belitser, S. V. and Klungel, O. H. (2006).
Instrumental variables. Applications and limitations. Epidemiology, 17, 260–​267.
McKelvey, R. D. and Zavoina, W. (1975). A statistical model for the analysis of ordinal
level dependent variables. Journal of Mathematical Sociology, 4, 103–​120.
Menard, S. (1995). Applied Logistic Regression Analysis. Thousand Oaks, CA: Sage
Publications, Inc.
Meuleman, B., Loosveldt, G. and Emonds, V. (2015). Regression analysis: Assumptions
and diagnostics. In Best, H. and C. Wolf: The Sage Handbook of Regression Analysis
and Causal Inference, pp. 83–​110. London: Sage Publications, Inc.
Mitchell, M. N. (2012). Interpreting and Visualizing Regression Models Using Stata.
College Station, TX: Stata Press.
Morgan, S. L. and Winship, C. (2015). Counterfactuals and Causal Inference. Second
Edition. New York: Cambridge University Press.
Muller, C., Winship, C. and Morgan, S. L. (2015). Instrumental variables regression.
In Best, H. and C. Wolf: The Sage Handbook of Regression Analysis and Causal
Inference, pp. 277–​299. London: Sage Publications, Inc.
Murray, M. P. (2006). Avoiding invalid instruments and coping with weak instruments.
Journal of Economic Perspectives, 20, 111–​132.
Nagler, J. (1994). Scobit: An alternative estimator to logit and probit. American Journal
of Political Science, 38, 230–​255.
Pearl, J. and Mackenzie, D. (2018). The Book of Why. The New Science of Cause and
Effect. New York: Basic Books.
Pedhazur, E. J. (1997). Multiple Regression in Behavioral Research. Explanation and
Prediction. Third Edition. Orlando, FL: Harcourt Brace College Publishers.
Pieters, R. (2017). Meaningful mediation analysis: Plausible causal inference and
informative communication. Journal of Consumer Research, 44, 692–​716.
Rosenbaum, P. R. (2017). Observation & Experiment. An Introduction to Causal
Inference. Cambridge, MA: Harvard University Press.
Sovey, A. J. and Green, D. P. (2010). Instrumental variables estimation in political
science: A readers’ guide. American Journal of Political Science, 55, 188–​200.
Sterba, S. K. (2009). Alternative model-​based and design-​based frameworks for infer-
ence from samples to populations: From polarization to integration. Multivariate
Behavioral Research, 44, 711–​740.
Stigler, S. M. (2016). The Seven Pillars of Statistical Wisdom. Cambridge, MA: Harvard
University Press.
Stock, J. H. and Watson, M. W. (2007). Introduction to Econometrics. Second Edition.
Boston, MA: Pearson Education.
Studenmund, A. H. (2017). Using Econometrics. A Practical Guide. Seventh Edition.
Harlow, England: Pearson Education Limited.
Thrane, C. (2018). Do experts’ reviews affect the decision to see motion pictures in
movie theaters? An experimental approach. Applied Economics, 50, 3066–​3075.
Train, K. E. (2009). Discrete Choice Models with Simulation. Second Edition.
New York: Cambridge University Press.
Wasserstein, R. L., Schirm, A. L. and Lazar, N. A. (2019). Moving to a world beyond
“p < 0.05”. American Statistician, 73(S1), 1–​19.
Wheelan, C. (2013). Naked Statistics. New York: W.W. Norton & Company.
Wickham, H. and Grolemund, G. (2016). R for Data Science. Sebastopol, CA: O’Reilly
Media, Inc.

a.indd 186 27-Aug-19 15:05:02


187

References 187
Williams, R. (2006). Generalized ordered logit/​partial proportional odds models for
ordinal dependent variables. The Stata Journal, 6, 58–​82.
Wolf, C. and Best, H. (2015). Linear regression. In Best, H. and C. Wolf: The Sage
Handbook of Regression Analysis and Causal Inference, pp. 55–​81. London: Sage
Publications, Inc.
Wooldridge, J. M. (2000). Econometric Analysis. A Modern Approach. USA: South-​
Western Publishing.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.
Second Edition. Cambridge, MA: MIT Press.

a.indd 187 27-Aug-19 15:05:02


188

Index

academic regression-​based confounder 44, 65


studies: causal analysis 168–​71; constant 19, 21, 64; Ordinary Least
description 164–​6; inferences Squares (OLS) 27–​30
166–​8; practical advice 173–​4; correctly specified regression
prediction 163–​4; structure 171–​3 model 93
additive model 102–​3 correlated residuals 91–​2
additivity 82, 101–​3 count data regression 177–​80
Agresti, A. 10, 79 covariate see independent variable
all-​else-​equal 47, 64, 81, 84–​6 criterion variable see dependent
Allison, P. D. 31–​2, 61, 81, 85, 93 variable
alternative hypothesis (H1) 74–5, 96 cross-​sectional data 31
Angrist, J. D. 156, 159
assumptions 81, 92–​3, 96; instrumental data 6, 65, 166–​8
variable (IV) regression 153–​6; logistic data types 31
(logit) regression 133; testable (for dependent variable (y) 6, 65
a random sample) 87–​92; testable description 164–​6
(not requiring random sample) 82–​7, difference-​in-​difference designs 182
101–​12; untestable 81–​2 do-​file 14–​15, 17
Dougherty, C. 32
bad control variable 116 dummy variable 20–​2, 50–​4, 65, 111–​12,
Baron, R. M. 116 120–​1, 147; instrumental variable (IV)
Berk, R. A. 31–​2, 61, 79, 85, regression 156–​7; linear probability
166–​7 model (LPM) 121–​2; see also logistic
Berry, W. D. 93 (logit) regression
Best, H. 32, 133 dummy variable sets 107–​8
Bollen, K. A. 159
Brant test 141–​2, 144 endogenous 119
Brechot, M. 156 error term (e) 30, 65, 87
Breen, R. 133 event history analysis 182
excess zeros 179–​80
Cameron, A. C. 180 exercises 62–​4, 94–​6, 134–​5, 148–​9
causal effect 168–​71; see also experimental data 31
instrumental variable (IV) explanatory power (R2) 56–​9, 66
regression; randomized controlled explanatory variable see independent
trial (RCT) variable
ceteris paribus see all-​else-​equal
coefficient see regression coefficient Finlay, B. 10, 79
collinearity see perfect multicollinearity Freedman, D. 79
commands 12–​14; see also R program Freese, J. 133, 136, 148, 180
confidence interval 72–​4, 96 Fullerton, A. S. 148
189

Index 189
Gangl, M. 182 probabilities 126–​7; probit model 133;
Green, D. P. 159 sequential choices 133; standard
Greene, W. H. 148 model 123–​4; see also ordinal (logit)
Gujarati, D. 32 regression model
Long, J. S. 133, 136, 148, 180
Halvorsen, R. 111 longitudinal data 31, 180–​1
Hensher, D. A. 148 lurking variables 44
Hernán, M. A. 159
heteroscedasticity 88–​9 Mackenzie, D. 61, 159
homoscedastic residuals 87–​9 marginal effects 124–​5, 140–​1
Hosmer, D. W. 133 Martens, E. P. 159
Hox, J. 182 matching 182
mean confidence interval 72–​3
Imbens, G. W. 159 measurement levels 8–​10
independent variable (x) 6, 65; mechanism see mediation models
assumptions 81–​2; see also dummy median regression 176–​7
variable mediation models 113–​16, 133
index 25–​6, 65 Meuleman, B. 86
inferences 166–​8 model fit 129–​30
influential outliers 86 Morgan, S. L. 159
instrument exogeneity 154 motivations see purposes
instrument relevance 153–​4 Muller, C. 158
instrumental variable (IV) regression multi-​level regression models 181–​2
150–​3, 158–​9; assumptions 153–​6; multicollinearity see perfect
dummy variable 156–​7; instrument multicollinearity
exogeneity 154; instrument multinomial logistic regression
relevance 153–​4 130–​2, 144–​5
interaction models 103–​5, 127–​8 multiple ordinal (logit) regression model
interaction variable 102, 144 see ordinal (logit) regression model
intercept see constant multiple regression 43–​5, 54–​6, 61, 65;
explanatory power (R2) 56–​9;
Kenny, D. A. 116 independent dummy variables 50–​4;
motivations 50; third variable 46–​9
Likert scale 136
linear probability model (LPM) negative binomial model 178–​80
121–​2, 125–​6 non-​additivity 101–​5
linear regression 11–​18, 31–​2; dummy non-​linearity 105–​12, 116, 128–​9, 142–​4
variable 20–​2; error term (e) 30; non-​representative samples/​
female students’ exercise hours 25–​7; populations 78
footballer pay/​performance 22–​5; normal distribution 71–​2, 93, 96
house size/​price 11–​22; ordinal normally distributed residuals 90–​1
variables 146–​7; Ordinary null hypothesis (H0) 74–5, 78, 97
Least Squares (OLS) 27–​30;
questionnaires 25–​7; slope and observational data 31
constant 19; see also assumptions observations see units
linearity assumption 82–​3, 105–​12 omitted variable bias 82
logarithms 108–​12 one-​sided significance test 78–​9
logistic (logit) regression 121; ordinal (logit) regression model 137–​8,
assumptions 133; interaction 147–​8; alternative strategies 144–​6;
models 127–​8; interpretation interaction effects 144; interpretation
124–​7; linear probability model 139–​41; marginal effects 140–​1;
(LPM) 125–​6; marginal effects non-​linearity 142–​4; predicted
124–​5; mediation models 133; probabilities 139–​40; proportional
model fit 129–​30; multinomial odds-​assumption 141–​2, 146;
130–​2; non-​linearity 128–​9; predicted sequential choices 146
190

190 Index
ordinal variables 136–​7, 146–​7 data regression 180–​1; quantile
Ordinary Least Squares (OLS) 27–​30 regression 176–​7; time-​series
outcome variable see dependent regression 181
variable relevant control variables 47–​8, 81–​2
outliers see influential outliers repeated sampling 71–​2
research questions 7, 46–​9
Palmquist, R. 111 residuals 65, 87–​92
panel data regression 180–​1 residuals versus predicted values 88
path model 113–​16 respondents 6, 66
Pearl, J. 61, 159 response variable see dependent variable
Pedhazur, E. J. 164 Robins, J. M. 159
perfect multicollinearity 83–​6 Rosenbaum, P. R. 159, 182
Pieters, R. 116
Pischke, J.-​S. 156, 159 samples 69–​70, 79, 97; random
point-​and-​click routine: SPSS 38–​42; sampling 70–​1; repeated
Stata 34–​8 sampling 71–​2
Poisson model 178 sequential choices 133, 146
population 69–​70, 79, 97 significance levels 76, 78
predicted probabilities 126–​7, 139–​40 significance probability (p) 76–​7
prediction 163–​4 significance testing 74–​9, 97
predictor variable see independent slope see regression coefficient
variable Sovey, A. J. 159
probability see linear probability SPSS (point-​and-​click routine) 38–​42
model (LPM) standard deviation 12
probit model 133, 147 standard error (SE) 73, 89, 97
proportional odds-​assumption Stata 6–​7, 11–​18, 54, 78; point-​and-​click
141–​2, 146 routine 34–​8
purposes 163; causal analysis 168–​71; stepwise regression 174
description 164–​6; inferences 166–​8; Sterba, S. K. 78
prediction 163–​4 Stock, J. H. 158, 181
studies see academic regression-​based
quadratic regression models 105–​7 studies
quantile regression 176–​7 subjects 6, 66
questionnaires 25–​7 survival analysis 182

R program 15–​18, 44–​6, 52–​3 t-​value 75–​6, 97


R2 see explanatory power time-​series regression 181
random sampling 70–​1, 97 Train, K. E. 133
randomized controlled trial (RCT) 168 treatment variable see independent
region of acceptance 75 variable
regressand see dependent variable Trivedi, P. K. 180
regression coefficient 13, 19; confidence
interval 73–​4; Ordinary Least Squares uncorrelated residuals 91–​2
(OLS) 27–​30; significance testing 74–​9 units 6, 12, 65
regression discontinuity designs 182
regression purposes see purposes variables 5–​6, 8–​10, 66; categorical 9;
regression-​based studies see academic numerical 9; ordered 9–​10; see also
regression-​based studies dependent variable (y); independent
regressor see independent variable variable (x)
related techniques 176, 182; count data
regression 177–​80; matching 182; Wasserstein, R. L. 79
median regression 176–​7; multi-​level Watson, M. W. 158, 181
regression models 181–​2; panel Wheelan, C. 79
191

Index 191
Wijngaards-​de Meij, L. 182 x see independent variable
Winship, C. 159
Wolf, C. 32, 133 y see dependent variable
Wooldridge, J. M. 28, 31–​2, 61, 79, 86,
93, 181–​2 zero conditional mean 93
192

You might also like