ch 5 2023 Eonometrics for acct and finance

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

CHAPTER V: Categorical Dependent Variable Models

5.1 Introduction
Standard linear regression models are applied when the dependent variable is
continuous such as asset returns, rental value of properties, saving, expenditure, output,
etc. However, there are many situations in which the dependent variable in a regression
equation simply represents a discrete choice assuming only a limited number of
values. Models involving dependent variables of this kind are called categorical
(limited, discrete or qualitative) dependent variable models. In such models, the
values that the dependent variables may take are limited to certain integers (e.g. 0, 1, 2,
3, and 4) or even binary (only 0 or 1).

Categorical dependent variable models may be used when a decision maker faces a
choice among a set of alternatives meeting the following criteria:
 The number of choices if finite
 The choices are mutually exclusive (the person chooses only one of the
alternatives)
 The choices are exhaustive (all possible alternatives are included)
The first criterion is a binding one. We can always refine the available choices so that
they can satisfy the last two criteria. Throughout our discussion we shall restrict
ourselves to cases of qualitative choice where the set of alternatives is binary. For the
sake of convenience the dependent variable is given a value of 0 or 1.

Example: Suppose we want to develop a model for prediction of business failure. Here
the appropriate form for the dependent variable would be a dummy variable taking the
values 0 and 1 since there are only two possible outcomes:

 1 if i th company fails
Yi  
0 otherwise ,

The independent variables that affect the success or failure (that is, indicators of
financial status) of companies may be working capital to total assets ratio, retained
earnings to total assets ratio, earnings before interest and taxes to total assets ratio, and
sale to total assets ratio. Thus, we would predict the probability of failure of
companies on the basis of these explanatory variables.

5.2 The linear probability model (LPM)


Here we would estimate a model where the outcomes Yi (a series of zeros and ones)
would be the dependent variable. The set of explanatory variables could include either
quantitative variables, dummies or both. This is then a standard linear regression
model:

Yi  1  2 X 2i 3X3i  . . .  k X ki  i …………………. (1)

Suppose, for example, we wanted to fit a model relating defaults on a bank loan (failing
to pay back loans on time) as a function of income. The dependent variable ( Yi ) is a
dummy variable taking 1 and 0:
CHAPTER V: Categorical Dependent Variable Models 74

1 if i th individual defaults on a bank loan


Yi   ……….. (2)
0 otherwise ,

The linear probability model is thus:

Yi  1  2 X 2i  i …………………………………...…… (3)

where X 2i is the income of ith individual. The probability that the ith individual defaults
on a bank loan, Pi  Pr ob(Yi  1) , is called the response probability. Similarly, the
probability that the ith individual is a non-defaulter on a bank loan is given by:
1  Pi  Pr ob(Yi  0) .

From equation (3) we can see that i assumes only two values:

if Yi  1 then i  1  ( 1  2 X 2i ) (with probability : Pi )


if Yi  0 then i  0  ( 1  2 X 2i ) (with probability : 1  Pi )

Hence, the probability distribution of i is:

i Probability

1  ( 1  2 X 2i ) Pi

( 1  2 X 2i ) 1  Pi

Recall: If X is a discrete random variable which can take on the values x1 , x 2 , . . ., x k


with respective probabilities P(x1 ) , P(x 2 ) , …, P(x k ) (such that
P(x1 )  P(x 2 )  . . .  P(x k )  1 ), then the expected value (or mean) of X is defined as:
E(X)  x1P(x1 )  x 2P(x 2 )  . . .  x k P(x k )   x P(x )
j j

Thus, the expectation (mean) of i is given by:

E(i )  [1  ( 1  2 X 2i )]x Pi  [( 1  2 X 2i )]x (1  Pi )


 Pi  ( 1  2 X 2i ) Pi  ( 1  2 X 2i )  ( 1  2 X 2i )Pi
 Pi  ( 1  2 X 2i )

In the standard regression analysis, we assume that E(i )  0 . Doing so here we have:

Pi  ( 1  2 X 2i )  0
 Pi  Pr ob(Yi  1)  1  2 X 2i

The fitted or estimated probability of defaulting for the ith individual is thus:
P̂i  ˆ 1  ˆ 2 X2i
75 Applied Econometrics for Accounting and Finance

where ̂1 and ̂ 2 are OLS estimators. The slope estimates for the linear probability
model can be interpreted as the change in the probability that the dependent variable
will be equal to one for a one-unit change in a given explanatory variable.

Example: Suppose the fitted model relating defaults on a bank loan and income (in
thousands of birr) is given by:
P̂i  0.15  0.0025X 2i

This model suggests that for every birr 1000 increase in income, the probability of
defaulting of an individual ( Pi  Pr ob(Yi  1) ) decreases by 0.0025 (or 0.25%). For
instance, an individual whose income is birr 10,000 will have a
0.15  0.0025(10)  0.125 (or 12.5%) probability of defaulting.

The problem with this model is that for any individual whose income is more than birr
60,000, the model-predicted probability of defaulting is negative. For instance, the
probability of defaulting of an individual whose income is birr 80,000 is:
0.15  0.0025(80)  0.05

Clearly, such predictions cannot be allowed to stand since we know that the probability
of an event is always a number between 0 and 1 (inclusive), that is, probabilities can
never be negative. The LPM can also produce probabilities that are greater than one.
Thus, the use of the LPM when the dependent variable is categorical may lead to
nonsense probabilities.

5.3 The logit model


The logit model approach overcomes the limitation of the LPM by using a function that
effectively transforms the regression model in such a way that the fitted values (that is,
the probabilities) are bounded within the (0 , 1) interval. Visually, the fitted regression
model appears as an S-shape rather than a straight line. This is shown in Figure 1
below.

1.0

0.8

0.6

0.4

0.2

0.0

Figure 1: The logistic distribution function

The logistic distribution function (F) of a random variable z is given by:


e zi
F(zi ) 
1  e zi

where e is the base of the natural logarithm. For a logit model with a single explanatory
variable ( X 2i ), the response probability is given by:
CHAPTER V: Categorical Dependent Variable Models 76

e1 2X2i
Pi  P(Yi  1) 
1  e1 2X2i

Similarly, the non-response probability is given by:

e1 2X2i 1
P(Yi  0)  (1  Pi )  1  
1  e1 2X2i 1  e1 2X2i

For the logit model, the ratio of the response probability to the non-response
probability:

Pi Pr ob(Yi  1)  e1 2 X2i   1  1 2 X 2i


   1 2 X 2i  1  e1 2 X2i  e
1  Pi Pr ob(Yi  0) 1  e   

is called the odds of Yi  1 against Yi  0 . The natural logarithm of the odds, called
log-odds or logit, is given by:

 P 
ln  i   1  2 X 2i
1  Pi 

This is nothing but a simple linear regression model where the dependent variable is the
log-odds instead of the observed values of Yi (which are all zeros and ones). In a
similar fashion, the multiple logistic regression model is given by:

 P 
ln  i   1  2 X 2i  3 X3i  . . .  k X ki
1  Pi 

5.4 Illustration

The following EViews output is a fitted logistic regression model in which the
dependent variable is:

1 if i th individual defaults on a bank loan


Yi  
0 otherwise ,

The independent variables are: Debt-to-Income ratio, Income, Level of education,


Number of residents in the household, Home ownership (Rent = 1, Own = 0), Retired
(No = 1, Yes = 0), Gender (Male = 0, Female = 1). From the output we can see that all
explanatory variables except gender are significant at the 1% level.

For categorical (qualitative) explanatory variables, the category that is assigned the
value zero is the reference category. When interpreting results, all comparisons are
made with reference to this category.
77 Applied Econometrics for Accounting and Finance

Dependent Variable: DEFAULT

Variable Coefficient Std. Error z-Statistic Prob.

C -7.047758 0.341044 -20.66522 0.0000


DEBT_INCOME_RATIO 0.119521 0.005838 20.47317 0.0000
EDUCATION 0.090996 0.011712 7.769301 0.0000
GENDER -0.027850 0.073790 -0.377424 0.7059
HOMEOWN 0.319665 0.077239 4.138649 0.0000
INCOME -0.002895 0.000776 -3.731612 0.0002
NO_RESIDENTS 0.163130 0.025019 6.520264 0.0000
RETIRED 3.094739 0.271139 11.41385 0.0000

McFadden R-squared 0.166787 Mean dependent var 0.234000


S.D. dependent var 0.423415 S.E. of regression 0.385342
Akaike info criterion 0.909844 Sum squared resid 741.2531
Schwarz criterion 0.920271 Log likelihood -2266.610
Hannan-Quinn criter. 0.913499 Deviance 4533.219
Restr. deviance 5440.646 Restr. log likelihood -2720.323
LR statistic 907.4265 Avg. log likelihood -0.453322
Prob(LR statistic) 0.000000

Obs with Dep=0 3830 Total obs 5000


Obs with Dep=1 1170

Is the model a good fit? To answer this we can use the Hosmer and Lemeshow Test.
The null hypothesis of this test is that the model fits the data well. As can be seen form
the table below the Chi-square test statistic is insignificant (as the p-value exceeds 5%).
Thus, we can conclude that the model fits the data well.
Goodness-of-Fit Evaluation for Binary Specification
Hosmer-Lemeshow Test
H-L Statistic 8.4942 Prob. Chi-Sq(8) 0.3867

We can also use the likelihood ratio (LR) test to assess the overall model fit. This tests
the joint null hypothesis that all slope coefficients except the constant are zero:

H 0 : 2 3  . . .  8  0
H1: at least one  j  0 , j  2,3, . . ,8

The p-value of the test, Prob(LR statistic), is less than 0.001. Thus, we reject the null
hypothesis at the one percent level of significance and conclude that the independent
variables are together statistically significant (or at least one of explanatory variables is
significant).

The response probability Pi  P r ob(Yi  1) refers to the probability that ith individual
defaults on a bank loan.

 If the estimated regression coefficient for a quantitative explanatory variable is


negative, then this means that increases in that variable decreases the probability
of defaulting and vice versa.
 If the coefficient of a qualitative explanatory variable is negative, then the odds
(or the likelihood) of defaulting is higher for the reference category (the category
CHAPTER V: Categorical Dependent Variable Models 78

that is assigned the value zero). On the other hand, if the coefficient is positive, then
the probability of defaulting is higher for the non-reference category as compared to
the reference category.
Interpretation of estimated regression coefficients

Debt-to-Income ratio
 The coefficient of debt-to-income ratio is positive. This implies that increases in
debt-to-income ratio increases the probability of defaulting, keeping all other
covariates fixed.
Household income
 The coefficient of income is negative. Thus, increases in income decreases the
probability of defaulting, keeping all other covariates fixed.
Number of residents in the household
 Since the coefficient is positive, increases in the number of residents in the house
leads to increases in the probability of defaulting on a bank loan.
Level of education
 The positive coefficient implies that an increase in the level of education of an
individual increases the probability of his/her defaulting.
Home ownership status
 The coefficient of home ownership is positive. Since the reference category is Own
(reside in own house), the odds (likelihood or probability) of defaulting is higher for
the non-reference category (those who reside in rented house). The odds ratio is
calculated as: exp(ˆ j )  exp(0.319665)  1.377 . The interpretation is that the odds
of defaulting are 1.377 times higher for those who reside in rented houses as
compared to those who reside in their own houses, keeping all other covariates
fixed.
Retired
 The coefficient of retired (whether the individual is retired or not) is positive, and
the odds ratio is: exp(ˆ j )  exp(3.095)  22.081 . Since the reference category is Yes
(retired), the likelihood of defaulting is about 22 times higher for those who are
not retired as compared to those who are retired, keeping all other covariates fixed.

You might also like