Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Dummy Variable Regression Models

-Yogita Yadav

In all the regression models discussed so far the dependent and explanatory variables were
quantitative in nature. But this may not always be the case. There are many occasions when the
explanatory variables are qualitative in nature, for example gender, religion, color, nationality etc.
These qualitative variables are called Dummy Variables (AKA indicator/ categorical/ binary/
dichotomous variables).
The qualitative variables can be quantified with the help of artificial variables i.e. 0 (zero) and 1 (one)
where 0 indicates the absence of an attribute and 1 indicates the presence of that attribute.

Suppose we have a regression model where Y is dependent on one qualitative variable – ‘Gender’

Yi =B1 + B2 Di + ui (1.1)

Yi = Annual food expenditure (in $)

Di = 0, for males

Di = 1, for females

Here, Di is a dummy variable representing gender.

The regression models that contain only dummy explanatory variables are called analysis-of-
variance (ANOVA) Models, e.g. (1.1)

Model (1.1) is similar to the two-variable regression model we have discussed except that the
explanatory variable is qualitative in nature (Di instead of Xi). Since Di remains fixed from sample to
sample (just like Xi) and assuming ui satisfy the usual assumptions of CLRM, OLS method can be used
to estimate the parameters of model (1.1)

Now,

Average annual food expenditure for males:

E( Yi / Di = 0 ) = B1

And, average annual food expenditure for females:

E( Yi / Di = 1 ) = B1 + B2

So, B2 measure by how much the average annual food expenditure of females differ from that of
males.
Since there is no continuous regression line, it is not appropriate to call B2 the slope coefficient. B2, in
this case, is called the differential intercept term.

We know that, using the OLS method,

b2 = ∑ di yi / ∑ di2

and,

b1 = Y̅ – b 2 D̅

where di = Di – D̅

* The category for which dummy takes the value 0 is known as the Benchmark/ Base/ Reference
Category. For this particular example, Male is the base category.

Let the estimated regression line for model (1.1) be given by

Ŷi = 3176 – 506 Di

(SE) (233) (329)

(t-ratio) (13.6) (-1.52)

r2 = 0.1890

So, average food expenditure of males = $3176 ( Di = 0 for males )

average food expenditure of females = 3176 – 506 = $2670 ( Di = 1 for females )

However, if we check for statistical significance of B2

Ho: B2 = 0

will not be rejected. (WHY?)

This means that statistically there is no significant difference between average annual expenditure
on food by males and females.

What if we change the base category (i.e. we change the indication for 0 and 1)???

Suppose now,

Di = 0 for female

and Di = 1 for males


How will the regression results change?

 b1 will be different (WHY?)


 The sign of b2 will change but the absolute value will remain the same (WHY?)

 r2 and t-ratio (for b2) will remain the same (WHY?)

Suppose, the newly estimated regression line is given by

Ŷ̂̂̂i = 2670 + 506 Di

(SE) (233) (329)

(t-ratio) (11.4) (1.52)

r2 = 0.1890

Therefore, change of base category will not affect the regression results.

What if we introduce two different dummy variables for the two categories i.e. males and

females?

Let the new regression line be given by

Yi = B1 + B2 D2i + B3 D3i + ui

Where, D2i = 0 for males

D2i = 1 for females

And, D3i = 0 for females

D3i = 1 for males

Estimating this particular regression will not be possible. (WHY?)

Because one of the assumption under CLRM is that of no perfect multi-collinearity. However, for this
model

D2i + D3i = 1

which means D2 and D3 are perfectly linearly related.


Such situation will lead to DUMMY VARIABLE TRAP and we would not be able to estimate the
regression coefficients.

GENERAL RULE: If a model has a common intercept, B1, and if a qualitative variable has M categories
then introduce only M-1 dummy variables. {For each dummy variable, base category should remain
the same}

Question: Instead of (1.1), suppose we estimate the following regression,

Ŷi = b1’ + b2’ Di’

Where, Di’ = 0 for males

Di’ = 2 for females

Check how b2’ will change? What would happen to SE(b2’) and t-ratio?

{HINT: Di’ = 2Di and we have done similar questions for change in Xi}

ANOVA regression models although useful, are not so common in the field of economics. In most
economic research a regression model contains a combination of qualitative and quantitative
variables. Such regression model (containing both type of variables) are called analysis-of-
covariance (ANCOVA) models.

Next we will extend (1.1) by adding an explanatory variable.

Let

PRF : Yi = B1 + B2 Di + B3 Xi +ui (1.2)

Where Yi = Annual expenditure on food

Xi = after tax income

and Di = 0 for males

Di = 1 for females

Suppose the estimated regression equation is

Ŷi = 1506 – 228.98 Di + 0.06 Xi

(t-ratio) (8.01) (-2.38) (9.64)

R2 = 0.9268
For (1.2),

 H0: B2 = 0 will be rejected (WHY?)

Therefore, B2 is statistically significant. This means gender has influence on food expenditure and
there is a significant difference between male and female expenditure on food.

 H0: B3 = 0 will be rejected (WHY?)

As after tax income increases, level of expenditure on food increases (which makes sense).

So, B3 is also statistically significant.

 R2 has increased significantly (compared to r2 value of 1.1)

CONCLUSION: MODEL 1.2 seems better than MODEL 1.1

In model 1.1, we were committing mis-specification error i.e. omission if a relevant explanatory
variable.

INTERPRETATION OF REGRESSION COEFFICIENTS:

B2 – If we keep after tax income constant then the mean food expenditure of females is less than
that if males by $228.98

B3 – If After tax income increases by $1, mean food expenditure increases by $0.06 or 60 cents ;
keeping the influence of gender constant.

Here, B3 is the marginal propensity of food consumption.

For model (1.2), we can have two different regression equations for the two categories.

Mean food expenditure regression for males:

Ŷi = 1506 + 0.06 Xi

Mean food expenditure regression for females:

Ŷi = 1278 + 0.06 Xi


Notice, intercept is different but slope is same for the two regression lines. Thus, we have a case of
parallel regressions.

DUMMY INTERACTING WITH SLOPE:


Next we will introduce a new term in model where dummy interacts with slope.

Yi = B1 + B2 Di + B3 Xi + B4 (Di Xi ) + ui (1.3)

For (1.3),

Average male food consumption expenditure:

E(Yi / Di = 0, Xi ) = B1 + B3 Xi

Average female food consumption expenditure:

E(Yi / Di = 1, Xi ) = (B1 + B2) + ( B3 + B4) Xi

Here, B2 – Differential intercept term

B4 – Differential slope term / slope drifter

Notice: When we add dummy in additive form (as we did in model 1.2) we look at differences in
intercept of the two categories and when we add dummy in the multiplicative form / interactive
form (as in model 1.3), we look at differences in the slope of the two categories.

Let the estimated regression equation be given by:


Ŷi = 1432 – 67.89 Di + 0.06 Xi – 0.006 (Di Xi)

(t-ratio) (5.76) (-0.193) (7.31) (-0.484)

R2 = 0.93

For the above regression line,

 H0: B2 = 0 will not be rejected (WHY?)

Di is statistically insignificant

 H0: B3 = 0 will not be rejected (WHY?)

(Di Xi) is also statistically insignificant

 R2 has increased marginally (whatever the small increase, is due to addition of an explanatory
variable)

CONCLUSION: Model 1.2 is better that Model 1.3. We are committing a mis-specification error in
model 1.3 i.e. inclusion of an unnecessary variable.

Therefore, model 1.2 seems to be the most relevant among the three models discussed as
far.

To summarize,

 H0 : B2 = 0  Checks for same intercept


 H0 : B4 = 0  Checks for same slope

So, we have the following four cases

1) H0 : B2 = 0  Reject

H0 : B4 = 0  Reject

We will get dis-similar regressions.


2) H0 : B2 = 0  Reject

H0 : B4 = 0  Don’t Reject

We will get parallel regressions.

3) H0 : B2 = 0  Don’t Reject

H0 : B4 = 0  Reject

We will get concurrent regressions.

4) H0 : B2 = 0  Don’t Reject

H0 : B4 = 0  Don’t Reject

We will get coincident regressions.

You might also like