Dummy Variable Regression Models: Dichotomous Variables)

Dummy Variable Regression Models
-Yogita Yadav
In all the regression models discussed so far the dependent and explanatory variables were
quantitative in nature. But this may not always be the case. There are many occasions when the
explanatory variables are qualitative in nature, for example gender, religion, color, nationality etc.
These qualitative variables are called Dummy Variables (AKA indicator/ categorical/ binary/
dichotomous variables).
The qualitative variables can be quantified with the help of artificial variables i.e. 0 (zero) and 1 (one)
where 0 indicates the absence of an attribute and 1 indicates the presence of that attribute.
Suppose we have a regression model where Y is dependent on one qualitative variable – ‘Gender’
Yi =B1 + B2 Di + ui (1.1)
Yi = Annual food expenditure (in $)
Di = 0, for males
Di = 1, for females
Here, Di is a dummy variable representing gender.
The regression models that contain only dummy explanatory variables are called analysis-of-
variance (ANOVA) Models, e.g. (1.1)
Model (1.1) is similar to the two-variable regression model we have discussed except that the
explanatory variable is qualitative in nature (Di instead of Xi). Since Di remains fixed from sample to
sample (just like Xi) and assuming ui satisfy the usual assumptions of CLRM, OLS method can be used
to estimate the parameters of model (1.1)
Now,
Average annual food expenditure for males:
E( Yi / Di = 0 ) = B1
And, average annual food expenditure for females:
E( Yi / Di = 1 ) = B1 + B2
So, B2 measure by how much the average annual food expenditure of females differ from that of
males.
Since there is no continuous regression line, it is not appropriate to call B2 the slope coefficient. B2, in
this case, is called the differential intercept term.
We know that, using the OLS method,
b2 = ∑ di yi / ∑ di2
and,
b1 = Y̅ – b 2 D̅
where di = Di – D̅
* The category for which dummy takes the value 0 is known as the Benchmark/ Base/ Reference
Category. For this particular example, Male is the base category.
Let the estimated regression line for model (1.1) be given by
Ŷi = 3176 – 506 Di
(SE) (233) (329)
(t-ratio) (13.6) (-1.52)
r2 = 0.1890
So, average food expenditure of males = $3176 ( Di = 0 for males )
average food expenditure of females = 3176 – 506 = $2670 ( Di = 1 for females )
However, if we check for statistical significance of B2
Ho: B2 = 0
will not be rejected. (WHY?)
This means that statistically there is no significant difference between average annual expenditure
on food by males and females.
What if we change the base category (i.e. we change the indication for 0 and 1)???
Suppose now,
Di = 0 for female
and Di = 1 for males

How will the regression results change?
 b1 will be different (WHY?)

 The sign of b2 will change but the absolute value will remain the same (WHY?)
 r2 and t-ratio (for b2) will remain the same (WHY?)
Suppose, the newly estimated regression line is given by
Ŷ̂̂̂i = 2670 + 506 Di
(SE) (233) (329)
(t-ratio) (11.4) (1.52)
r2 = 0.1890
Therefore, change of base category will not affect the regression results.
What if we introduce two different dummy variables for the two categories i.e. males and
females?
Let the new regression line be given by
Yi = B1 + B2 D2i + B3 D3i + ui
Where, D2i = 0 for males
D2i = 1 for females
And, D3i = 0 for females
D3i = 1 for males
Estimating this particular regression will not be possible. (WHY?)
Because one of the assumption under CLRM is that of no perfect multi-collinearity. However, for this
model
D2i + D3i = 1
which means D2 and D3 are perfectly linearly related.

Such situation will lead to DUMMY VARIABLE TRAP and we would not be able to estimate the
regression coefficients.
GENERAL RULE: If a model has a common intercept, B1, and if a qualitative variable has M categories
then introduce only M-1 dummy variables. {For each dummy variable, base category should remain
the same}
Question: Instead of (1.1), suppose we estimate the following regression,
Ŷi = b1’ + b2’ Di’
Where, Di’ = 0 for males
Di’ = 2 for females
Check how b2’ will change? What would happen to SE(b2’) and t-ratio?
{HINT: Di’ = 2Di and we have done similar questions for change in Xi}
ANOVA regression models although useful, are not so common in the field of economics. In most
economic research a regression model contains a combination of qualitative and quantitative
variables. Such regression model (containing both type of variables) are called analysis-of-
covariance (ANCOVA) models.
Next we will extend (1.1) by adding an explanatory variable.
Let
PRF : Yi = B1 + B2 Di + B3 Xi +ui (1.2)
Where Yi = Annual expenditure on food
Xi = after tax income
and Di = 0 for males
Di = 1 for females
Suppose the estimated regression equation is
Ŷi = 1506 – 228.98 Di + 0.06 Xi
(t-ratio) (8.01) (-2.38) (9.64)
R2 = 0.9268
For (1.2),
 H0: B2 = 0 will be rejected (WHY?)
Therefore, B2 is statistically significant. This means gender has influence on food expenditure and
there is a significant difference between male and female expenditure on food.
 H0: B3 = 0 will be rejected (WHY?)
As after tax income increases, level of expenditure on food increases (which makes sense).
So, B3 is also statistically significant.
 R2 has increased significantly (compared to r2 value of 1.1)
CONCLUSION: MODEL 1.2 seems better than MODEL 1.1
In model 1.1, we were committing mis-specification error i.e. omission if a relevant explanatory
variable.
INTERPRETATION OF REGRESSION COEFFICIENTS:
B2 – If we keep after tax income constant then the mean food expenditure of females is less than
that if males by $228.98
B3 – If After tax income increases by $1, mean food expenditure increases by $0.06 or 60 cents ;
keeping the influence of gender constant.
Here, B3 is the marginal propensity of food consumption.
For model (1.2), we can have two different regression equations for the two categories.
Mean food expenditure regression for males:
Ŷi = 1506 + 0.06 Xi
Mean food expenditure regression for females:
Ŷi = 1278 + 0.06 Xi

Notice, intercept is different but slope is same for the two regression lines. Thus, we have a case of
parallel regressions.
DUMMY INTERACTING WITH SLOPE:

Next we will introduce a new term in model where dummy interacts with slope.
Yi = B1 + B2 Di + B3 Xi + B4 (Di Xi ) + ui (1.3)
For (1.3),
Average male food consumption expenditure:
E(Yi / Di = 0, Xi ) = B1 + B3 Xi
Average female food consumption expenditure:
E(Yi / Di = 1, Xi ) = (B1 + B2) + ( B3 + B4) Xi
Here, B2 – Differential intercept term
B4 – Differential slope term / slope drifter
Notice: When we add dummy in additive form (as we did in model 1.2) we look at differences in
intercept of the two categories and when we add dummy in the multiplicative form / interactive
form (as in model 1.3), we look at differences in the slope of the two categories.
Let the estimated regression equation be given by:

Ŷi = 1432 – 67.89 Di + 0.06 Xi – 0.006 (Di Xi)
(t-ratio) (5.76) (-0.193) (7.31) (-0.484)
R2 = 0.93
For the above regression line,
 H0: B2 = 0 will not be rejected (WHY?)
Di is statistically insignificant
 H0: B3 = 0 will not be rejected (WHY?)
(Di Xi) is also statistically insignificant
 R2 has increased marginally (whatever the small increase, is due to addition of an explanatory
variable)
CONCLUSION: Model 1.2 is better that Model 1.3. We are committing a mis-specification error in
model 1.3 i.e. inclusion of an unnecessary variable.
Therefore, model 1.2 seems to be the most relevant among the three models discussed as
far.
To summarize,
 H0 : B2 = 0  Checks for same intercept

 H0 : B4 = 0  Checks for same slope
So, we have the following four cases
1) H0 : B2 = 0  Reject
H0 : B4 = 0  Reject
We will get dis-similar regressions.

2) H0 : B2 = 0  Reject
H0 : B4 = 0  Don’t Reject
We will get parallel regressions.
3) H0 : B2 = 0  Don’t Reject
H0 : B4 = 0  Reject
We will get concurrent regressions.
4) H0 : B2 = 0  Don’t Reject
H0 : B4 = 0  Don’t Reject
We will get coincident regressions.

Dummy Variable Regression Models: Dichotomous Variables)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dummy Variable Regression Models: Dichotomous Variables)

Uploaded by

Copyright:

Available Formats

Dummy Variable Regression Models

Yi = Annual food expenditure (in $)

Here, Di is a dummy variable representing gender.

Average annual food expenditure for males:

And, average annual food expenditure for females:

We know that, using the OLS method,

Let the estimated regression line for model (1.1) be given by

Ŷi = 3176 – 506 Di

(SE) (233) (329)

(t-ratio) (13.6) (-1.52)

So, average food expenditure of males = $3176 ( Di = 0 for males )

average food expenditure of females = 3176 – 506 = $2670 ( Di = 1 for females )

However, if we check for statistical significance of B2

will not be rejected. (WHY?)

and Di = 1 for males

 b1 will be different (WHY?)

 r2 and t-ratio (for b2) will remain the same (WHY?)

Suppose, the newly estimated regression line is given by

Ŷ̂̂̂i = 2670 + 506 Di

(SE) (233) (329)

(t-ratio) (11.4) (1.52)

Let the new regression line be given by

Where, D2i = 0 for males

D2i = 1 for females

And, D3i = 0 for females

D3i = 1 for males

Estimating this particular regression will not be possible. (WHY?)

which means D2 and D3 are perfectly linearly related.

Question: Instead of (1.1), suppose we estimate the following regression,

Ŷi = b1’ + b2’ Di’

Where, Di’ = 0 for males

Di’ = 2 for females

Next we will extend (1.1) by adding an explanatory variable.

PRF : Yi = B1 + B2 Di + B3 Xi +ui (1.2)

Where Yi = Annual expenditure on food

Xi = after tax income

and Di = 0 for males

Suppose the estimated regression equation is

Ŷi = 1506 – 228.98 Di + 0.06 Xi

(t-ratio) (8.01) (-2.38) (9.64)

 H0: B2 = 0 will be rejected (WHY?)

 H0: B3 = 0 will be rejected (WHY?)

So, B3 is also statistically significant.

 R2 has increased significantly (compared to r2 value of 1.1)

CONCLUSION: MODEL 1.2 seems better than MODEL 1.1

INTERPRETATION OF REGRESSION COEFFICIENTS:

Here, B3 is the marginal propensity of food consumption.

Mean food expenditure regression for males:

Ŷi = 1506 + 0.06 Xi

Mean food expenditure regression for females:

Ŷi = 1278 + 0.06 Xi

DUMMY INTERACTING WITH SLOPE:

Average male food consumption expenditure:

Average female food consumption expenditure:

E(Yi / Di = 1, Xi ) = (B1 + B2) + ( B3 + B4) Xi

Here, B2 – Differential intercept term

B4 – Differential slope term / slope drifter

Let the estimated regression equation be given by:

(t-ratio) (5.76) (-0.193) (7.31) (-0.484)

For the above regression line,

 H0: B2 = 0 will not be rejected (WHY?)

 H0: B3 = 0 will not be rejected (WHY?)