Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

PREDICTION OF OCCURRENCE OF BAD DEBT USING

PROPERTY-RELATED VARIABLES

PROJECT REPORT

Submitted in partial fulfilment for the completion of the course in

QUANTITATIVE TECHNIQUES II
[QT2BJ21-2]

Submitted by:

ANKITA BHATTACHARYA BJ21130


DIVYA SHRIVASTAVA BJ21139
KUSHANKUR DATTA BJ21148
PAYAL DABIR BJ21157
SANIL KHEMANI BJ21166
SURAJ RAIYANI BJ21175

Under the guidance of


Prof. Pritha Guha

Group 9C, Section C


Business Management (2021 – 23)
CONTENTS
Introduction ....................................................................................................... 3
Data Description ................................................................................................ 3
Exploratory Data Analysis ................................................................................. 5
Regression Modelling ...................................................................................... 10
Multicollinearity Diagnostics ........................................................................... 16
Tests of Regression Coefficients ...................................................................... 16
Model Performance/ Validation ....................................................................... 17
Results and Interpretation ................................................................................ 18
Conclusion ....................................................................................................... 19
References ....................................................................................................... 20
INTRODUCTION
For our analysis, we have chosen a very pertinent problem of identifying whether a payee
account will result in bad debt for the municipality of that region. We chose a historical dataset
from a few municipalities in South Africa to aid our analysis.
We tried to gauge any dependency of bad debts on the type of account held with the
municipality. For example, it is common knowledge that Industrial accounts and other
institutions such as schools and colleges are usually debt-free. In contrast, some properties like
government-owned premises and municipality land have a lower record of paying bills due to
lower synergies and internal conflicts.
Moreover, even among the individual households, the availability of a social security number
lends credibility to the households. They usually have a better record of paying bills in a timely
fashion.
Lastly, people living in upscale localities who live in houses with high property value have no
issues maintaining a good record of monthly bill payments.
Our analysis tries to determine the veracity of such assertions and estimates the extent to which
each of these factors results in bad debt.
Our goal is to determine the factors that are most strongly responsible/ related to the
municipalities collecting bad debts. We try to assess the possibility of bad debt by considering
various account-associated factors such as account type, property value and size, and the total
billing amount.

DATA DESCRIPTION
We have selected a data set from Kaggle on "Municipal Debt Risk Analysis" for our analysis.
The dataset is used to predict if an account will result in bad debt. The data for the same has
been collected from the billing systems of 8 municipalities in South Africa over two years.
Evaluating the dataset:
1. Who collected the data?
The author of the dataset is Dylan Rawlins, a Kaggle user.
2. What is the source of the data?
Municipal Finance Systems of 8 Municipalities in South Africa.
3. When was the dataset created?
It was created in July 2020 and is expected to be updated annually.
4. What was the purpose of the dataset?
To predict if an account is a bad debt, based on various parameters like account
category, property value, property size, total billing and so on.
5. What type of data was collected?
The account category, property details, billing details, receipting details, debt records,
total write off, collection ratio, electricity bill, debt billing ratio, and bad debt details.
A sample of the dataset is given below:

accountcategoryid 12 1 1 1 1
accountcategory Place of Worship Residential Residential Residential Residential
acccatabbr POW RES RES RES RES
propertyvalue 0 0 0 0 0
propertysize 0 0 0 0 0
totalbilling 4177 2084 4177 1862 4456
avgbilling 116 58 116 155 124
totalreceipting 16525 0 0 0 0
avgreceipting 2066 0 0 0 0
total90debt 0 25387 30269 19819 25161
totalwriteoff 0 0 0 0 0
collectionratio 3.96 0 0 0 0
debtbillingratio 0 12.18 7.25 10.64 5.65
totalelecbill 0 0 0 0 0
hasidno 0 0 0 0 0
baddebt 0 1 1 1 1
Table 1: Sample of the Dataset(5 entries)

Brief description of the parameters in our analysis:


1. Account Category ID (accountcategoryid): Categorical (Numerical)
2. Account Category (accountcategory): Categorical (Character)
3. Account Category Abbreviation (acccatabbr): Categorical (Character)
4. Property Value (propertyvalue): Discrete (Numerical)
5. Property Size (propertysize): Discrete (Numerical)
6. Total Billing (totalbilling): Discrete (Numerical)
7. Average Billing (avgbilling): Discrete (Numerical)
8. Total Receipting (totalreceipting): Discrete (Numerical)
9. Average Receipting (avgreceipting): Discrete (Numerical)
10. Total Debt (total90debt): Discrete (Numerical)
11. Total Writeoff (totalwriteoff): Discrete (Numerical)
12. Collection Ratio (collectionratio): Continuous (Numerical)
13. Debt Billing Ratio (debtbillingratio): Continuous (Numerical)
14. Total Electricity Bill (totalelecbill): Discrete (Numerical)
15. Has ID Number (hasidno): Binary (Numerical)
16. Bad Debt (baddebt): Binary (Numerical)
Figure 1: Summary of the Dataset

EXPLORATORY DATA ANALYSIS


1. Correlation Matrix: We used the correlation matrix to ascertain the linear correlation
between the dependent variables.

Variables with correlation(ρ) >= 0.7:

We came across three pairs of variables with high correlation:

1. Total Billing and Total


Receipting(ρ=0.9043): This shows
that the higher the amount an
account is billed, the higher is the
amount received by the Municipal
corporation, which is very
commonsensical since the
variables have a linear
relationship.
2. Total Billing and Total Electricity
Bill(ρ=0.8746): We can infer that
the electricity bill is a part of the
Total Bill; hence, when the
electricity bill increases, the total
bill also increases.
3. Total Receipting and Total Electricity Figure 2: Correlation Matrix
Bill(ρ=0.9543): The same logic
applies to the receipt amount wherein the higher the electricity bill, the higher the bill
amount paid by the customers, thus contributing to the increase in total receipt amount.

No two variables are highly negatively correlated. The negative correlation values revolve
around 0.
Interesting insights: The correlation between "Bad Debt" and whether the customer has an ID
No. is 0.3067. This shows that whether a debt will go bad is slightly correlated to whether the
customer has an ID no.

2. Bar Plot: We used bar plots to understand the relationship between the various
categorical variables. Our dataset had three categorical variables:

a. Account Category(“acccatabbr”) - 12 categories.


b. Whether the customer has an ID("hasidno")- 2 categories.
c. Whether the account is a Bad Debt("baddebt")- 2 categories.

a. Account Category("acccatabbr") v/s Whether the account is a Bad


Debt("baddebt")

We constructed a percent-stacked bar plot to understand the percentage of each account


category that has resulted in bad debt.

We can see that out of all different types of accounts, the "Unknown" accounts (UKN) have
the highest percentage of Bad Debts, followed by "Municipal" accounts (MUN). "Educational"
(EDU) and "Industry" (IND) accounts have the lowest percentage of Bad Debts.

From the stacked count bar plot, we can see that the "Residential" (RES) account has the
highest number of Bad Debts (~45,000).

Figure 3: Percent Stacked Bar Plot 1 Figure 4: Stacked Bar Plot 1


b. Whether the customer has a social
security number("hasidno") v/s Whether
the account is a Bad Debt("baddebt")

We can see that there are more bills for people with
no social security numbers. The percentage of
accounts with bad debt is more for customers who
have a social security number.

Figure 5: Stacked Bar Plot 2

c. Account Category("acccatabbr") v/s


Whether the customer has a social
security number("hasidno")

We constructed a percent-stacked bar plot to


understand each account category's percentage with
a social security number.

We can see that out of all different types of


accounts, the "Unknown" accounts (UKN) have the
highest percentage of social security numbers,
followed by "Residential" accounts (RES) and
"Municipal" (MUN) accounts. "Educational"
(EDU), "Environmental Management" (ENV) and
"Government" (GOV) accounts have the lowest
percentage of social security numbers.

Figure 6: Percent Stacked Bar Plot 2

3. Scatter plot

a. Analyzing the relationship of Debt to Billing Ratio and Collection Ratio of Bad
Debt

We used a scatterplot to see how the "baddebt" behaves when analyzed using the two ratios
in our data set - "debtbillingratio", which is the ratio between the debt issued and bills
generated for a particular account and "collectionratio" which is the amount of bill paid v/s
that generated for any given account. Using this, we wanted to see the relationship between
these ratios and the presence of debt being defaulted.

" baddebt" being a categorical variable, has only two values, 0 or 1, representing the
presence of defaulted debts. This can be seen from the graph where we have light blue
showing defaulted debts and dark blue showing debts paid back on time.

● In the case of bad debt,


we can see that the
Collection Ratio value
for most of the account
categories is constant
except for one outlier.
● However, in case of no
bad debt, the Collection
Ratio values vary as
shown from dark blue
points with an outlier as
shown in the graph.

The reverse can be said about


the debtbillingratio, which has a
Figure 7: Scatterplot 1
variation in the case of bad debt
but slight deviation for the other
case.

We can further categorically


analyze the variation based on
account types. We can see here
that most of the variation for bad
debt comes from Residential
account type (RES) as can be
inferred from the bar graphs
below. Further residential
account type is maximum in
number, thus taking up a
significant chunk of the
transactions.
Figure 8: Scatterplot 2
b. Scatter plot of Property Value v/s Total Debt in the Past 90 Days: We created a
scatter plot to understand the relationship between the property value and the total debt
in the past 90 days. We can see that the total debt in the past 90 days is inversely related
to the property value, as depicted by the scatter plot. This indicates that the accounts
with high property values have relatively lower debt than accounts with higher property
values.

Figure 9: Scatterplot 3- Total90Debt v/s Property Value

4. Box plot - Analyzing the outliers based on account type

a. Property value based on Account category ID:


The box-plot of Property value v/s Account
Category ID points at the presence of outliers. This
could be due to certain accounts not owning
property and certain accounts (Government,
Business) owning huge estate like properties.

Figure 10: Boxplot 1


b. Total debt in the past 90 days based on
Account category ID:
The box plot of the total debt in the past 90 days
v/s Account category ID points at the presence of
outliers for most of the account types except
3,8,9,13(Industry, Infrastructure, Public Benefit
& Environmental Management). This can be due
to the fact that industrial and infrastructure
accounts usually pay off their debts within the
stipulated time.

Figure 11: Boxplot 2

REGRESSION MODELLING
We split the dataset into train and test dataset using a 70-30 split for our analysis. We then
trained our model on the train dataset and used the model to predict the values of the test
dataset.
To build the logistical regression model, we first cleaned the dataset by analysing two major
factors:
1. Missing values
There are no missing values in the dataset. The output is shown as follows.

2. Outliers
We found several outliers in the data on creating boxplots, so simply removing them could
cause a massive loss in information. The majority of outliers were present in propertyvalue and
propertysize variables which could be due to people not owning property; as a result, the
variable had several occurrences of the value 0.
Figure 12: Outliers by variable

Model Building
During Logistic regression modelling, we analyzed the impact of various independent variables
on the dependent variable i.e., baddebt and attempted to predict the value of baddebt variable.
The model was built on the train dataset, and it was tested for its accuracy on the test dataset.

Building the basic regression model – Model_1


Assuming all the variables were significant in building the logistical model, a basic regression
model was built.
Given below is the summary of the basic model_1

Figure 13: Output of Model 1


The p-values for propertyvalue, totalbilling, averagebilling, totalreceipting, averagereceipting,
collectionratio, totalelecbill were very high. Hence, they were not significant enough to be
included as predictor variables in the logistical model. So, we created a revised model after
dropping the variables.

Updated regression model – Model_2


After Dropping the variables as mentioned above, we arrived at this model.
Given below is the summary of the updated model_2

Figure 14: Output of Model 2

We analyzed the summary of model_2 and observed that the p-value of all the variables was
very low which implied that all the variables were significant in the modelling now. However,
the AIC value increased significantly. So, we built a new model through backward elimination
using the step function to achieve a lower AIC value and using that model as our final
prediction model to predict the results in the test dataset.
Performing the step function, we arrived at the following result

Figure 15: Output of Model 2 after Step function

In the step function, we realized that the AIC values reduced significantly to 3208.32 from
2177123 when the debtbillingratio variable was not included as a predictor variable. Hence,
we built our third model, Model_3.

Updated regression model – Model_3


Dropping the debtbillingratio from Model_2, we arrived at this model.
Given below was the summary of the updated model_3

Figure 16: Output of Model 3


We observed that the accountcategoryid variable was a significant categorical variable. To
leverage the variable in the model building process, we converted accountcategoryid to a factor
variable through the one-hot encoding method and then used those variables in our modelling
to achieve a better AIC value from the resulting model. Hence, we built model_4.

Updated regression model – Model_4


Using one hot encoding, we arrived at this model.
Given below is the summary of the updated model_4

Figure 17: Output of Model 4

In this model, we observed that the AIC value had reduced to 3117.8. However, the p values
for all the encoded variables were very high, so we applied the step function to check if we
could get a better model with reduced p-values.
Given below is the final result of the step function on model_4

Figure 18: Output of Model 4 after Step Function

We got the following output on building model_5 based on the result of the step
function applied on model_4.
Updated regression model – Model_5
Based on the results of the step function applied on model_4, we arrived at this
model.
Given below is the summary of the updated model_5

Figure 19: Output of Model 5


In this model, we observed that the p values were very low for almost all the variables and the
AIC value was the least at 3106.1.
Summarizing the models, we built the five models with the following AIC values.
Model Model_1 Model_2 Model_3 Model_4 Model_5
AIC 3219.7 2177123 3208.3 3117.8 3106.1
Table 1: AIC values of all the models

We decided to further analyze model_3, model_4 and model_5 for the following reasons.
Model_3 – This model had the least AIC value without the accountcategoryid being one-hot
encoded.
Model_4 – This model had all the one-hot encoded accountcategoryid variables.
Model_5 – This model had the least AIC value out of all the models.

MULTICOLLINEARITY DIAGNOSTICS
For multicollinearity diagnostic, we checked the VIF values for all the variables. The results
for all the three selected models are as shown

It was observed that all the variables in model_3 had VIF values less than 5 suggesting there
was little to no correlation among variables in model_3. In model_4, some variables had high
VIF values, implying that the variables were highly correlated. Model_5 had one variable
which had a moderately high VIF value and hence showed a moderate correlation.

TESTS OF REGRESSION COEFFICIENTS


To test the models for the significance of the variables, we used two approaches
• G-Test for the model
• Z test for individual variables.
For G-Test, we took the null hypothesis and alternate hypothesis as follows.
Null: All the coefficients are equal to zero, implying that no variable is significant for the
model.
Alternate: At least one coefficient is not zero, which implies at least one variable is significant
for the model.
The table below summarizes the G test results for the three models.
Model Model_3 Model_4 Model_5
Null deviance 133869.9 133869.9 133869.9
Residual deviance 3196.3 3083.8 3088.1
Test statistic 130673.6 130786.1 130781.8
Df 5 16 8
p-value ≈0 ≈0 ≈0
Table 2: G Test result for Model 3,4,5

The resulting p-value was observed to be very low, which implied that we have sufficient
evidence to reject the null hypothesis. Hence, at least one variable in all three models was
significant. To check the significance of individual variables, we looked at the z-test results in
the summary of the model.
For the Z test, we took the null and alternate hypotheses as follows.
Null: The coefficient of the variable under consideration is equal to zero, which implies the
variable is not significant.
Alternate: The coefficient of the variable under consideration is not equal to zero, which
implies the variable is significant.
Analyzing the P values for individual variables in the models under consideration, we observed
that the p-values for all the variables in model_3 and model_5 were very low. Hence, we could
reject the null for all the variables, implying that the variables were significant. However, the
p-Values of some variables in model_4 were significantly high, and hence we could not reject
the null, which implied that the model might not be accurate.

MODEL PERFORMANCE/ VALIDATION


We built the confusion matrix and analyzed our model's specificity and sensitivity matrix to
observe our model performance, both on test and train datasets. For the purpose of prediction,
we took the cutoff values equal to 0.5.
Confusion matrix for model_3 Confusion matrix for model_3
(Train Data) (Test Data)

Confusion matrix for model_4 Confusion matrix for model_4


(Train Data) (Test Data)

Confusion matrix for model_5 Confusion matrix for model_5


(Train Data) (Test Data)

Model Model_3 Model_3 Model_4 Model_4 Model_5 Model_5


(dataset) (train) (test) (train) (test) (train) (test)
Sensitivity 0.99424 0.99408 0.99424 0.99408 0.99424 0.99408
Specificity 1 0.99995 1 1 1 1
Table 3: Sensitivity and Specificity for Model 3,4,5

RESULTS AND INTERPRETATION


All the three models gave the same values for sensitivity and specificity and hence were all
suitable models to be chosen for further analysis. We recommend that model_3 is selected for
further analysis due to the following reasons:
• It captures all the significant variables.
• The p-values of all the variables in this model are very low, which shows all the
variables in this model are significant.
• The multicollinearity analysis shows that all the variables included in Model_3 have
very less correlation as compared to other models.
Hence the equation for our logistic regression model is:
𝑒 −5.89+.06547∗𝑎𝑐𝑐𝑜𝑢𝑛𝑡𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦𝑖𝑑+0.00000001313∗𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦𝑠𝑖𝑧𝑒+9.069∗𝑡𝑜𝑡𝑎𝑙90𝑑𝑒𝑏𝑡+7.626∗𝑡𝑜𝑡𝑎𝑙𝑤𝑟𝑖𝑡𝑒𝑜𝑓𝑓+0.9248∗ℎ𝑎𝑠𝑖𝑑𝑛𝑜
p=1+𝑒 −5.89+.06547∗𝑎𝑐𝑐𝑜𝑢𝑛𝑡𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦𝑖𝑑+0.00000001313∗𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦𝑠𝑖𝑧𝑒+9.069∗𝑡𝑜𝑡𝑎𝑙90𝑑𝑒𝑏𝑡+7.626∗𝑡𝑜𝑡𝑎𝑙𝑤𝑟𝑖𝑡𝑒𝑜𝑓𝑓+0.9248∗ℎ𝑎𝑠𝑖𝑑𝑛𝑜

𝑝
logit(p)= log ( ) = −5.89 + .06547 ∗ 𝑎𝑐𝑐𝑜𝑢𝑛𝑡𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦𝑖𝑑 + 0.00000001313 ∗
1−𝑝
𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦𝑠𝑖𝑧𝑒 + 9.069 ∗ 𝑡𝑜𝑡𝑎𝑙90𝑑𝑒𝑏𝑡 + 7.626 ∗ 𝑡𝑜𝑡𝑎𝑙𝑤𝑟𝑖𝑡𝑒𝑜𝑓𝑓 + 0.9248 ∗ ℎ𝑎𝑠𝑖𝑑𝑛𝑜

CONCLUSION
Through our analysis, we have established that variables such as Account Type, Property
Size, Total outstanding debt in the past 90 days, and Availability of Social Security
Number are significant factors that determine whether an account will result in bad debt for
the municipalities.
This conclusion follows our initial intuition about bad debt being correlated with the various
account parameters. We can logically conclude that certain account categories (such as
government, municipality) have higher instances of debt going bad, which our model validates.
We can also validate our initial assumption that people with bigger properties are differently
likely to default on debt than people with smaller property sizes through our model.
Thus, our model could predict whether a customer will default with fairly high accuracy.

FUTURE SCOPE
In the analysis, we would like to include the following in future to analyse the dataset further
and to try and make the model more accurate.
• Standardize the data and use the standardized data to build the model.
• Analyse the interaction effect of the independent variables.
• Build different models using different approaches like Naiive Bayes, Decision trees,
Support Vector Machines, etc and use the model with the highest accuracy.
REFERENCES
• Kaggle.com. 2022. Comprehensive EDA on R with Logistic Regression. [online]
Available at: <https://www.kaggle.com/mbkinaci/comprehensive-eda-on-r-with-
logistic-regression>
• Holtz, Y., 2022. Grouped, stacked and percent stacked barplot in ggplot2. [online] R-
graph-gallery.com. Available at: <https://www.r-graph-gallery.com/48-grouped-
barplot-with-ggplot2.html>.
• Kaggle.com. 2022. Municipal Debt Risk Analysis. [online] Available at:
<https://www.kaggle.com/dmsconsultingsa/municipal-debt-risk-analysis>.
• Correspondence Analysis in Archaeology. 2022. R function for binary Logistic
Regression. [online] Available at: <http://cainarchaeology.weebly.com/r-function-for-
binary-logistic-regression.html>].
• Stack Overflow. 2022. Stack Overflow - Where Developers Learn, Share, & Build
Careers. [online] Available at: <https://stackoverflow.com/>

You might also like