Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

Dummy Variables in Regression

What is a Dummy Variable?


A dummy variable (an indicator variable) is a numeric variable that
represents categorical data, such as gender, race, political affiliation,
etc.
Technically, dummy variables are dichotomous, quantitative
 variables. Their range of values is small; they can take on only two
quantitative values.
As a practical matter, regression results are easiest to interpret when
dummy variables are limited to two specific values, 1 or 0.
Typically, 1 represents the presence of a qualitative attribute, and 0
represents the absence.
How Many Dummy Variables?

• The number of dummy variables required to represent a particular categorical


variable depends on the number of values that the categorical variable can
assume. To represent a categorical variable that can assume k different values, a
researcher would need to define k - 1 dummy variables.
• For example, suppose we are interested in political affiliation, a categorical
variable that might assume three values - Republican, Democrat, or
Independent. We could represent political affiliation with two dummy variables:
• X1 = 1, if Republican; X1 = 0, otherwise.
• X2 = 1, if Democrat; X2 = 0, otherwise.
• In this example, notice that we don't have to create a dummy variable to
represent the "Independent" category of political affiliation. If X1 equals zero and
X2 equals zero, we know the voter is neither Republican nor Democrat.
Therefore, voter must be Independent.
The Dummy Variable Trap

• When defining dummy variables, a common mistake is to


define too many variables.
• If a categorical variable can take on k values, it is tempting to
define k dummy variables. Resist this urge. Remember, you
only need k - 1 dummy variables.
• A kth dummy variable is redundant; it carries no new
information and it creates a severe multicollinearity problem
for the analysis.
• Using k dummy variables when only k - 1 dummy variables are
required is known as the dummy variable trap. Avoid this trap!
How to Interpret Dummy Variables

• Once a categorical variable has been recoded as a dummy variable, the


dummy variable can be used in regression analysis just like any other
quantitative variable.
• For example, suppose we wanted to assess the relationship between
household income and political affiliation (i.e., Republican, Democrat, or
Independent). The regression equation might be:
• Income = b0 + b1X1+ b2X2
• where b0, b1, and b2 are regression coefficients. X1 and X2 are regression
coefficients defined as:
• X1 = 1, if Republican; X1 = 0, otherwise.
• X2 = 1, if Democrat; X2 = 0, otherwise.
• The value of the categorical variable that is not represented
explicitly by a dummy variable is called the reference group. In
this example, the reference group consists of Independent voters.
• In analysis, each dummy variable is compared with the reference
group.
• In this example, a positive regression coefficient means that
income is higher for the dummy variable political affiliation than
for the reference group; a negative regression coefficient means
that income is lower.
• If the regression coefficient is statistically significant, the income
discrepancy with the reference group is also statistically
significant.
EXAMPLE OF DUMMY VARIABLE
• Consider the table below. It uses three variables to describe 10 students.
Two of the variables (Test score and IQ) are quantitative. One of the
variables (Gender) is categorical.
Student Test score IQ Gender
1 93 125 Male
2 86 120 Female
3 96 115 Male
4 81 110 Female
5 92 105 Male
6 75 100 Female
7 84 95 Male
8 77 90 Female
9 73 85 Male
10 74 80 Female
• For this problem, we want to test the usefulness of IQ
and Gender as predictors of Test Score. To accomplish
this objective, we will:
• Recode the categorical variable (Gender) to be a
quantitative, dummy variable.
• Define a regression equation to express the
relationship between Test Score, IQ, and Gender.
• Conduct a standard regression analysis and interpret
the results.
Dummy Variable Recoding

• The first thing we need to do is to express gender as one or more dummy


variables. How many dummy variables will we need to fully capture all of the
information inherent in the categorical variable Gender? To answer that
question, we look at the number of values (k) Gender can assume. We will
need k - 1 dummy variables to represent Gender. Since Gender can assume
two values (male or female), we will only need one dummy variable to
represent Gender.
• Therefore, we can express the categorical variable Gender as a single dummy
variable (X1), like so:
• X1 = 1 for male students.
• X1 = 0 for non-male students.
• Now, we can replace Gender with X1 in our data table.
Student Test score IQ   X1  
1 93 125 1
2 86 120 0
3 96 115 1
4 81 110 0
5 92 105 1
6 75 100 0
7 84 95 1
8 77 90 0
9 73 85 1
10 74 80 0
Note that X  identifies male students explicitly. Non-male students are the reference
group. This was a arbitrary choice. The analysis works just as well if you use X  to
1

identify female students and make non-female students the reference group.
1
The Regression Equation

• At this point, we conduct a routine regression analysis. No special


tweaks are required to handle the dummy variable. So, we begin by
specifying our regression equation. For this problem, the equation
is:
• ŷ = b0 + b1IQ + b2X1
• where ŷ is the predicted value of the Test Score, IQ is the IQ score,
X1 is the dummy variable representing Gender, and b0, b1, and b2 are
regression coefficients.
• Values for IQ and X1 are known inputs from the data table. The only
unknowns on the right side of the equation are the regression
coefficients, which we will estimate through least-squares
regression.
Regression Coefficients

For now, the key outputs of interest are the least-squares estimates for regression
coefficients. They allow us to fully specify our regression equation:
ŷ = 38.6 + 0.4 * IQ + 7 * X 1

This is the only linear equation that satisfies a least-squares criterion. That means this
equation fits the data from which it was created better than any other linear equation.
Significance of Regression Coefficients
• With multiple regression, there is more than one independent
variable; so it is natural to ask whether a particular independent
variable contributes significantly to the regression after effects of
other variables are taken into account. The answer to this question
can be found in the regression coefficients table:
• The regression coefficients table shows the following information for
each coefficient: its value, its standard error, a t-statistic, and the
significance of the t-statistic. In this example, the t-statistics for IQ
and gender are both statistically significant at the 0.05 level. This
means that IQ predicts test score beyond chance levels, even after
the effect of gender is taken into account. And gender predicts test
score beyond chance levels, even after the effect of IQ is taken into
account.
• The regression coefficient for gender provides a measure of
the difference between the group identified by the dummy
variable (males)
• and the group that serves as a reference (females).
• Here, the regression coefficient for gender is 7. This
suggests that, after effects of IQ are taken into account,
males will score 7 points higher on the test than the
reference group (females).
• And, because the regression coefficient for gender is
statistically significant, we interpret this difference as a real
effect - not a chance effect.
INTERCEPT DUMMY VARIABLE

• In general, if:
• Y = b0 + b1IQ+ µ -------------------------------------------------- (1)
• Introduce the gender dummy variable X1 
• Y = b0 + b1IQ + b2X1 + µ -------------------------------------------------- (2)
• Where: b0, b1, and b2 are regression coefficients. IQ and X1 are variables of the model.
X1 is defined as the categorical variable Gender represented as a single dummy variable:
• X1 = 1 for male students.
• X1 = 0 for non-male students.
• Y is the test score.
• And in order to get interpretation for X1 consider the two possible values of X1 and how they will affect
the specification of equation (2) above. For X1 = 0 we will have:
• Y = b0 + b1IQ + b2(0) + µ -------------------------------------------------- (3)
• Y = b0 + b1IQ + µ -------------------------------------------------- (4)
• Equation 4 is the same as the original model without the dummy variable X 1.
• If X1 = 1, we will have:
• Y = b0 + b1IQ + b2(1) + µ -------------------------------------------------- (5)
• Y = b0 + b1IQ + b2 + µ -------------------------------------------------- (6)
• Y = (b0 + b2) + b1IQ + µ -------------------------------------------------- (7)
• The constant or intercept of equation 6 is now different from b0 and is equal
to (b0 + b2). So we can see that by including the dummy variable, the value of
the intercept has changed, shifting the function and therefore the regression
line) up or down.
• Relating to our estimate, in the above example,
• ŷ = 38.6 + 0.4 * IQ + 7 * X1
• This suggests that, after effects of IQ are taken into account, females’ score
on the average is 38.6 (b0) males will score 7 points higher or 38.6 + 7 = 45.6
(b0 + b2).
The effect of a dummy variable on the constant of the regression line

• Y
• b1 > 0
• b0 + b1
• b0 b1 < 0

• b0 + b1

• X
SLOPE DUMMY VARIABLE

• Consider our model one above:


• Y = b0 + b1IQ+ µ
• The slope coefficient (b1) is the marginal impact of IQ to the test scores. It is
given as: Y
b 1
IQ

.Suppose that we think that the last five students (serial number 6 to 10) out of
the 10 students took IQ enhancing drugs. In order to test this, we need to
construct a dummy variable (D) that will take the following values:
• D = 0 for student who did not take the drug (serial number 1 to 5)
• D = 1 for students who took the drug (serial number 6 to 10)
• This dummy variable, because we assume that it affected the slope para
meter must be included in the model in the following multiplicative way:
• Y = b0 + b1IQ + b2DIQ + µ -----------------------------------------(8)
• The effect of the dummy variable can be separated again according to two
different outcomes. If D = 0, we will have:
• Y = b0 + b1IQ + b2(0)IQ + µ -----------------------------------------(9)
• Y = b0 + b1IQ + µ -----------------------------------------(10)
• Equation 10 is the same as initial condition.
• If D = 1:
• Y = b0 + b1IQ + b2(1)IQ + µ --------------------------------------(11)
• Y = b0 + b1IQ + b2IQ + µ
• Y = b0 + (b1 + b2)IQ + µ -----------------------------------------(12)
• So, the marginal impact of IQ to test performance for students that did not
take drug is b1 and the marginal impact of IQ to test performance for students
that took drug is
(b1 + b2).
The effect of a dummy variable on the slope of the regression

• Slope b1 + b2
• Y
• b2 > 0
• Slope b1

• b0 b2 < 0

• b2 < 0

• X
The combined effect of intercept and slope dummies
• Let us suppose that we have a dummy variable defined as:
• D = 0 for student who did not take the drug
• D = 1 for students who took the drug
• Given our model:
• Y = b0 + b1IQ + µ
• Then using dummy variable to examine its effects on both the constant and the
slope we have:
• Y = b0 + b1IQ +b2D + b3D(IQ) + µ
• If D = 0;
• Y = b0 + b1IQ + µ (As before)
• If D = 1;
• Y = b0 + b1IQ + b2 + b3IQ + µ = (b0 + b2) + (b1 + b3)IQ + µ

You might also like