Professional Documents
Culture Documents
Understanding Regression Equations Interpreting Regression Tables PDF
Understanding Regression Equations Interpreting Regression Tables PDF
Robert Upton
Cambridge, Massachusetts USA
Contents
1 Introduction 1
4 Special Cases 11
4.1 Interaction Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Interaction Example 1: Finding Relevant Coefficients . . . . . . . 12
4.1.2 Interaction Example 2: Interpreting a Table A . . . . . . . . . . . 13
4.1.3 Interaction Example 3: Interpreting a Table B . . . . . . . . . . . 14
4.2 Difference in Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Glossary 17
7 Recommended Resources 18
References 18
1 Introduction
Regression tables are an integral part of how economists share information about their
data and explain their findings. This guide is intended to help you understand the basics
of regression in order to interpret the results of regression analyses. Using explanation
and examples, we will go through the basic structure of regression equations and learn
how to glean relevant and useful information from regression tables. Someone new to
regression will benefit most from reading the entire reference manual in order while the
more experienced user can skip to the relevant section of interest. Each section contains
a set of key words that link to their definition in the glossary. Each section begins with
an explanation of the topic and examples follow.
1
2 What is Linear Regression?
2.1 Understanding a Regression Equation
• Regression Analysis
• Regression Equation
• Independent Variable
• Dependent Variable
• Regression Coefficient
Figure 1: This graph shows the relationship between height and weight. Notice how the
line follows the trend of the data.
2
In the above graph each dot represents a data point (in this case, each dot represents
a person), and the equation of the line is our regression. Our predicted height for any
given person is the point along the line corresponding to that person’s weight. While
the outcome variable is what we are interested in changing or observing, we usually
are not concerned about finding predicted values. Rather, we are interested in how the
independent variables affect the outcome variable, which the coefficients tell us. In the
following equation, we are interested in the height of a person and think it may depend
on a person’s age, parental income, and cigarettes smoked in a year.
3
Now let’s make a chart of the βs. To find the coefficients, we look at the constant
(β0 ) and then the numbers that multiply the independent variables to get
β0 β1 β2 β3
20 1.9 .0004 -.001
So what do these numbers mean? Let’s look at β1 . If we look at the first equation
we see that β1 is the number that multiplies Age. This means that β1 shows the change
in height due to a one unit (in this case, one year) increase in age. Since the number is
1.9, we know that a 1 year increase in age results in a 1.9 inch increase in height. What
about β3 ? We see that the corresponding number is -.001 and so we know that a 1 unit
increase in the number of cigarettes smoked a year leads to a .001 inch decrease in height.
If we wanted to predict the height of someone who is 20 years old, earns $20,000, and
smokes 365 cigarettes a year, we just plug those numbers into the regression equation to
get
Often, the reason we use regression equations is to tease out the results of an experiment.
In order to use regression to test the effects of receiving the treatment in an experiment,
we add what is called a dummy variable. A dummy variable is a variable that
equals 1 or 0 depending on whether or not a person can be defined by some
designated descriptor. For example, in the case of an experiment, the variable
equals 1 if the person is in the treatment group and 0 if the person is not. Let’s
say we are trying to find the effect of giving a person a million dollars on that person’s
health. In order to test this, we give a random set of people a million dollars and give
the rest nothing. In order to account for this in regressions, we create a variable called
T reat and set it equal to 1 if the person was in treatment and therefore received 1 million
dollars and we set T reat equal to 0 if we gave them nothing. The variable, T reat, is the
treatment dummy. We will also add in variables for age, sex (also a dummy variable, in
which Sex = 1 if female and 0 if male), and number of siblings. The resulting equation
is
4
2.4 Control Variables
• Control Variable
Let’s look at a regression that has the variable T reat which indicates whether or not
someone received treatment.
5
HeightBob . We can subtract the two equations to get
HeightSally − HeightBob = β0 + β1 ∗ T reatSally + β2 ∗ P arent IncSally +
(1)
− (β0 + β1 ∗ T reatBob + β2 ∗ P arent IncBob + )
Since Sally and Bob are identical in all aspects except for treatment, we know that the
income of Sally’s and Bob’s parents are the same ie P arent IncBob = P arent IncSally , so
when we subtract the two equations we get
Figure 2: This table shows the effect of weight, mileage and car type on price.
6
First notice that there are two columns. Each column represents a different regression
equation. The first column (column (1)) shows a regression where price is our dependent
variable, and our independent variables are weight and mileage. The second column
represents a regression equation that also includes car type as an independent variable.
At the top of each column is the dependent variable (P rice), and each row represents an
independent variable.
Focus only on column (1) for the example. The result we are trying to find is the
price of a car and so the price is the dependent variable. What are we using to predict
the price? We are going to use only mileage and weight because they are the variables
included in the regression equation that gives us the information in column (1). Car type
is left blank in column (1) because this variable was not included in the regression equa-
tion corresponding to column (1). Car type was excluded from this regression equation
because, in this case, it is not part of what we are interested in. Finally, we need to know
what the regression coefficients (βs) are. At the moment, we have the following equations
for column (1).
β0 β1 β2
1946.1 1.747 -49.51
Let’s now interpret these coefficients. We know that β1 is the change in price due to a
unit increase in weight. This means that for every pound the weight increases, the price
increases by $1.747. For β2 , every increase in mpg leads to a -$49.51 change in price.
We can translate between a regression table and a regression equation in
the following way: first, find the outcome we are trying to predict (normally
the top of the column) and and plug it in for the dependent variable; then
find what we are using to make this prediction (usually the left most column),
then look at each row to get the βs.
7
Figure 3: The table shows how ones attitude toward affirmative action changes based on
ones roommates.
Approve = β0 + β1 ∗ B + β2 ∗ M in + β3 ∗ I1 + β4 ∗ I2 + β5 ∗ I3 + β6 ∗ I4 +
8
In order to find the effects of having a black roommate on one’s attitude towards
Affirmative Action, we compare two students who are the same in every way except that
one of them has a black roommate. Let’s write the regression equation for person 1 and
person 0. For person 1 the regression equation is
To find the difference in approval between the two students we subtract their equations
to get
The important thing to remember is that we are assuming that the two people are
identical in all ways except for whether or not they have a black roommate, so the
variable values for all variables other than B are the same. In other words, this means
that M in1 = M in0 , I11 = I10 , I21 = I20 , I31 = I30 , and I41 = I40 . On the other hand,
since person 1 has a black roommate, B1 = 1 and since person 0 does not have a black
roommate, B0 = 0. This causes the above equation to become
Approve1 − Approve0 = β1 ∗ 1 − β1 ∗ 0 = β1
Looking at the original regression equation, we see that β1 multiplies B. This means
that β1 is the effect of having a black roommate. Returning to the regression table, we
look at the row for having a black roommate and see that it is .489. Since .489 > 0,
the table is saying that having a black roommate increases the likelihood that someone
approves of Affirmative Action.
9
Figure 4: Table shows the effect of each grandparent being eligible for a pension.
W eight/HeightScore = β0 + β1 ∗ M M + β2 ∗ F M + β3 ∗ M F + β4 ∗ F F + controls +
Let’s try to find the effect of one’s mother’s mother being eligible. In order to do this
we need to compare two hypothetical people (Person 1 and 2) who are the exact same
except that one’s mother’s mother is eligible while the others is not. We will do this later,
but for now the regression equations for person 1 and 2 are
W eight/Height1 = β0 + β1 ∗ M M1 + β2 ∗ F M1 + β3 ∗ M F1 + β4 ∗ F F1 + controls +
W eight/Height2 = β0 + β1 ∗ M M2 + β2 ∗ F M2 + β3 ∗ M F2 + β4 ∗ F F2 + controls +
We are interested in the difference of their scores, so we subtract one equation from
the other. We will shorten Weight/Height to W/H.
10
W/H1 − W/H2 = β0 + β1 ∗ M M1 + β2 ∗ F M1 + β3 ∗ M F1 + β4 ∗ F F1 + controls +
− (β0 + β1 ∗ M M2 + β2 ∗ F M2 + β3 ∗ M F2 + β4 ∗ F F2 + controls + )
(3)
Now we will plug in the values for each person. As stated before, the two people are
the same in every way except that the mother’s mother of person one is eligible for a
pension, while the mother’s mother of person two is not eligible for a pension. This
implies M M1 = 1 and M M1 = 0. The rest of the variables are the same so F M1 = F M2 ,
M F1 = M F2 , and F F1 = F F2 .
4 Special Cases
4.1 Interaction Variables
• Interaction Variable
11
Control (T reat = 0) Treatment (T reat = 1)
Male (Sex = 0) β0 β0 + β1
Female (Sex = 1) β0 + β2 β0 + β1 + β2
We derive the coefficients for a male and female in treatment in the next sections as
an example. In the table, we can the see the height of someone who entered treatment
as a male or a female. This means that instead of looking at the effect of treatment on
the whole treatment group we can focus on the subcategories male and female. If we are
looking for the height of someone who is part of the treatment and is male we look at the
top right box and get β0 + β1 . Otherwise you plug in the relevant values and see what
coefficients are left in the equation. This type of variable will become essential when we
look at regression tables.
The key takeaway is this: interaction variables are just two variables that
multiply each other. The point of interactions is to look at the effects of an
independent variable on the dependent variable for different subgroups. This
gives important information as the treatment may effect different demograph-
ics differently.
HeightJoe = β0 + β1 ∗ 1 + β2 ∗ 0 + β3 ∗ 1 ∗ 0 + = β0 + β1 +
This corresponds with the intersection of male and T reat = 1 in the table above.
Now let’s do the same derivation for a female. The equation for a female in treatment
(call her Megan) is
HeightM egan = β0 + β1 ∗ T reatM egan + β2 ∗ SexM egan + β3 ∗ T reatM egan ∗ SexM egan +
Since Megan is a female, SexM egan = 1, and since Megan is in the treatment group,
T reatM egan = 1. When we plug these values into the regression equation we find that
HeightM egan = β0 + β1 ∗ 1 + β2 ∗ 1 + β3 ∗ 1 ∗ 1 + = β0 + β1 + β3 +
This corresponds to the intersection of female and treatment in the table above! Let’s
say we want to compare the effects of the treatment on a male and female. To do this,
we subtract the Joe’s height from Megan’s height. (HeightM egan − HeightJoe ) gives us
12
4.1.2 Interaction Example 2: Interpreting a Table A
Let’s look at a table with some interactions now. Tracking is a dummy variable for
whether or not the student was in a school where classes were split based on the students’
ability level.
Figure 5: This is a table linking academic tracking to a student’s total test score. The
table also accounts for what quartile the student started in based on test scores
First lets take a look a column 4 and let’s try to find the effect of being tracked for
a student in the second quartile. First, we will define dummy variables for whether or
not a student belongs in a certain quartile: let X1, X3, and X4 represent a dummy for
whether or not the student was in the first, third or fourth quartile. T rack is a dummy
variable for whether or not the student was in a tracking school. The equation for the
regression is
Since we are looking for the effect of tracking on a student in the second quartile, we
need to compare one student to another and have the only difference between the two
students be that one receives treatment and the other doesn’t. This means that both
students are in the second quartile. Let’s say student A is in treatment and student B is
not. We will shorten T rack to T . Student A’s regression is
ScoreA = β0 + β1 ∗ 1 + β2 ∗ 0 ∗ 1 + β3 ∗ 0 ∗ 1 + β4 ∗ 0 ∗ 1 + = β0 + β1
TA = 1 because student A is in treatment and the rest of the variables are 0 because
Student A is not in the bottom, second to bottom, or top quartile. On the other hand,
student B’s regression is
13
ScoreB = β0 + β1 ∗ 0 + β2 ∗ 0 ∗ 0 + β3 ∗ 0 ∗ 0 + β4 ∗ 0 ∗ 0 + = β0
Where TB = 0 because student B is not in treatment, and the rest of the variables
are 0 because Student B is not in the bottom, second to bottom, or top quartile.
The difference between the two regressions is
This means that being tracked causes a .18 point increase in total score for a second
quartile student, compared with everyone else (i.e. those who were not in tracking schools
as well as those who were in tracking schools but not in the second quartile).
In order find the effect of tracking on a student in the third quartile we need to
compare two students in the third quartile and have the only difference betweem the two
be the fact that one is tracked and the other isn’t. Let’s say student A is tracked and
student B is not. The equation for A is
ScoreA = β0 + β1 + β3
We plugged in TA = 1 since person A receives treatment and we set all X = 0 except
X3 = 1 since the student is in the third quartile. For person B, TB = 0, and the X’s are
the same as person A
ScoreB = β0
Finally when we subtract B from A we find
14
4.2 Difference in Differences
• Difference in Differences Estimator
The final type of regression we will discuss is difference in differences. This type of
regression is used to compare how one group’s outcome changes compared to how the
other group’s outcome changes over a certain period of time. For example, we could use
it to find the effects of an increase in minimum wage on employment. In order to use
Difference in Differences it is best to use two groups that are essentially the same. They
deal with the same issues and circumstances so when a change happens, say a change in
the minimum wage, the only difference between the two groups is the change in minimum
wage.
For this example, we will compare the groups of people who are on the two sides of
the Pennsylvania-New Jersey border when New Jersey suddenly hiked its minimum wage
up. Since the people are geographically very close and subject to the same economy,
we assume that the New Jersey side of the border would undergo the same trend that
the Pennsylvanians dealt with without the hike of minimum wage. The only difference
between the groups is the change in minimum wage. Since this is the case, you can
compare the economic trend PA undergoes to the trend NJ undergoes. By subtracting
the trend of PA away from NJ you can theoretically find the minimum wage effect. The
below graphic tries to paint a picture to explain this.
Figure 6: The top solid line is the trend for PA, the bottom solid line is the trend for NJ,
and the dotted bottom line is the trend NJ would have underwent if it did not increase
minimum wage.
The regression table for this is below and focuses on outcomes for restaurants near
the border.
15
To find the effect of the increase in minimum wage on employment, we first find the
change in PA employment over time and we see the change is 21.17−23.33 = −2.16. This
means PA experienced a 2.16 drop in the average number of fast food workers in each
restaurant, and it is important to note that PA did not raise its minimum wage. On the
other hand, NJ experienced a change in employment of 21.03 − 20.44 = .59, so while the
employment in PA dropped, the employment in NJ did not! If we believe that the only
difference between the two states is the hike in minimum wage, then we are somehow led
to believe that an increase in minimum wage actually increased employment. It seems
unlikely, but the difference in trends is N J − P A or .59 − (−2.16) = 2.76, and so it
seems the increase in minimum wage actually increased employment by 2.76 people per
restaurant. Now keep in mind this result is barely statistically significant, and this paper
is very hotly contested because we rely on many assumptions, but this is a good example
of difference in difference nevertheless.
16
significant, we use regression. In other words, we use regression to figure out if the
difference is large enough or precise enough to care about. When testing who lives longer
after receiving free health care we don’t want to compare those who are already sick to
those who are not sick. We want them to be basically equivalent on whatever metrics we
can gather. If there is a statistically significant difference between groups on a baseline
variable, we can use the baseline variable as a control variable in our final regression
(where the outcome is the dependent variable). Below is an example from a paper that
looks at the effects of a summer job on youth violence.
When we look above we see the statistics are nearly the same. An important part
of this table is the final column: P-value. Think of P-value as Probability Value or the
probability that the two numbers are essentially the same due to random chance. For
example, in the percent days absent row, we see that there is a 99% chance the two groups
are the same.
6 Glossary
• Control Variable - A type of independent variable included in a regression. We are
not interested in what effect these variables have on the dependent variable, but
think that they might effect the outcome. Therefore, we add control variables to
make sure these factors are accounted for.
17
• Difference in Differences Estimator - The difference in differences estimator is the
difference in trend for the treatment group over the course of treatment minus the
trend in the control group over the course of treatment.
7 Recommended Resources
• A crash course in econometrics and regressions - Mastering ’Metrics: The Path
from Cause to Effect by Jorn-Steffen Pischke and Joshua Angrist.
• Introduction to regression
• Interactions
• Difference in Differences
References
[1] Duflo. Grandmothers and granddaughters: Old age pension and intra-household
allocation in south africa. American Economic Review, 90(2):393–398, 2000.
[2] Kremer Duflo, Dupas. Peer effects and the impact of tracking: Evidence from a
randomized evaluation in kenya. American Economic Review, 101(5):1739–1774, 2011.
18
[3] Krueger Card. Minimum wages and employment: A case study of the fast food
industry in new jersey and pennsylvania. American Economic Review, 84(4):772–793,
1994.
[4] Heller. Summer jobs reduce violence among disadvantaged youth. Science,
346(6214):1219–1223, 2014.
19